Selective and Features based Adversarial Example Detection

03/09/2021 ∙ by Ahmed Aldahdooh, et al. ∙ 0

Security-sensitive applications that relay on Deep Neural Networks (DNNs) are vulnerable to small perturbations crafted to generate Adversarial Examples (AEs) that are imperceptible to human and cause DNN to misclassify them. Many defense and detection techniques have been proposed. The state-of-the-art detection techniques have been designed for specific attacks or broken by others, need knowledge about the attacks, are not consistent, increase model parameters overhead, are time-consuming, or have latency in inference time. To trade off these factors, we propose a novel unsupervised detection mechanism that uses the selective prediction, processing model layers outputs, and knowledge transfer concepts in a multi-task learning setting. It is called Selective and Feature based Adversarial Detection (SFAD). Experimental results show that the proposed approach achieves comparable results to the state-of-the-art methods against tested attacks in white box scenario and better results in black and gray boxes scenarios. Moreover, results show that SFAD is fully robust against High Confidence Attacks (HCAs) for MNIST and partially robust for CIFAR-10 datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL

) has achieved remarkable advances in different fields in human life especially computer vision tasks like object detection, image classification 

[30, 64, 59], surveillance [34], and medical imaging [63]. Despite that, it is found that DL models are vulnerable to adversaries [69, 19]. In image classification models, for instance, adversaries can generate AEs, by adding small perturbations to an input image that are imperceptible to human and devices, that cause DL model to misclassify the input images. Such potential threat affects security-critical DL-based applications [24] such as self-driving cars.

Adversaries can generate AEs for white-box, black-box, and gray-box attacks [1, 25]. In white-box attack scenario, the adversary knows everything about the DL-model including inputs, outputs, architecture, and weights of the model. Hence, he is guided by the model gradient to generate AE by solving an optimization problem [19, 33, 49, 10, 45]. In black-box scenario, the adversary knows nothing about the model but he leverages the transferability property [55] of AEs and the input content. He can craft small perturbations that is harmonious with an input image [12, 14, 68, 29]. In gray-box scenario, the adversary knows only the input and the output of the model and hence, he tries to substitute the original model with an approximated model and then uses its gradient as in white-box scenario to generate AEs. Researchers pay attention to this threat and several emerging methods have been proposed to detect or to defend against AEs. Defense techniques such that adversarial training [19, 45, 73, 70], feature denoising [74, 6, 37], pre-processing [50, 58], and gradient masking [56, 54, 22, 51]

try to make the model robust against the attacks and let the model correctly classifying the

AEs. On the other hand, other works focus on detecting AEs. Detecting approaches can be categorised into supervised and unsupervised detection [7]. In supervised detection, detectors use AEs in the training process [44, 15, 41]. The main limitation of this approach is that it requires knowledge of existing attacks and might not be robust against new/unknown attacks. In unsupervised detection, the focus of this paper, detectors rely on the fact that AEs are features [27]. In the predicting process, AEs

trigger different neurons in different layers and the effect of triggers increases/ propagates when the network goes deeper. Methods like

Feature Squeezing (FS[75], denoiser-based [47], DNN invariants analysis [43] are present. Although these approaches achieve promising results, they may have one or more limitation(s); not performing well with some known attacks [75], broken by attackers [3, 9], performance of baseline detectors is not consistent [7], increase the model parameters overhead [42], time-consuming [43], or introduce latency [17] in the inference time. More discussion about detection techniques is provided in Section 2.1.

Figure 1: High-level model architecture. The input sample is passed to the CNN model to get outputs of -last layers to be processed in the detector classifiers. SFAD

yields prediction and selective probabilities to determine the prediction class of the input sample and whether it is adversarial or not.

DL uncertainty is one of the main methods that has been used to determine whether an input sample belongs to the training manifold. The uncertainty is usually measured by adding randomness to the model using Dropout [67] technique. It is found that clean samples predictions not changed while it changed for AEs. Feinman et al. [16] proposed Bayesian Uncertainty (BU

) metric that used Monte Carlo dropout to estimate the uncertainty, to detect those

AEs that are near the classes manifold, while Smith et al. [65] used mutual information method. The prediction risk of these methods is higher than other uncertainty methods [18].

In this paper and in order to mitigate the aforementioned limitations, we integrate four concepts in a multi-task setting to propose a novel AEs detector that hasn’t had any knowledge of AEs (unsupervised detection). Selective prediction [18], analysing the representative data of last -layers of the DNN, knowledge transfer, and ensemble prediction concepts are explored. We name the proposed method as Selective and Feature based Adversarial Detection (SFAD). The high-level architecture of the SFAD is illustrated in Figure 1. We remind here that unsupervised detection relies on the fact that AEs are features. Hence, 1) SelectiveNet  [18] is an alternative to dropout technique. According to [18], SelectiveNet measures the uncertainty with less risk. As a novel uncertainty measure, it is used in this paper to explore/study its impact in detecting adversarial examples. It aims at rejecting anomaly inputs, not necessarily adversarial, from being predicted since they don’t belong to input data distribution and don’t lie on their manifold. In this paper, we treat AEs as a special anomaly data and we assume the capability of detection increases when more perturbations are added to the input sample. According to the author knowledge, the concept of SlelectiveNet [18] is not used in an adversarial attacks detection models. 2) Using representative data of the last -layers of the DNN model is not new. It has been used in [46, 66, 41] without any processing. In this paper we process the representative data of the last -layers outputs to build -CNNs with different processing techniques like up sampling and down sampling, auto-encoders [72, 38], noise addition [36, 39, 40], and bottleneck layer addition [26] that make the representative data of last layers more unique to the input data distribution. 3) The third concept, knowledge transfer, distills the output/knowledge that comes from the -CNNs as input to build the last CNN. This step has great impact in reducing the effect of white-box attacks. 4) Ensemble AEs detection is then utilized since we have predictions that come from the DNN and from the detector.

A prototype of SFAD is tested under white-box, black-box and gray-box attacks on MNIST [35], CIFAR10 [31]

, and ImageNet NIPS2017 

[32] datasets. The experimental results show that the proposed method can detect AEs at least with accuracy of 89.8% (many with 99%) for all tested attacks except for the Projected Gradient Descent (PGD) attack [45] with at least 65% detection accuracy in average. The proposed method shows that it is robust against HCAs [8] (100% on MNIST, 56% for CIFAR). We follow the steps of [66] and set our thresholds to reject 10% of clean images. Moreover, comparisons with state-of-the-art methods are presented. Hence, our key contributions are:

  • [noitemsep,topsep=8pt,itemsep=4pt,partopsep=4pt, parsep=4pt]

  • We propose a novel unsupervised model for adversarial prediction that integrates uncertainty/selective based prediction, feature enhancement, and knowledge distillation in a multi-task learning setting. In the proposed model, the classification and the detection processes share information and work in parallel as one unit to provide the prediction and adversarial status labels, hence no inference latency is introduces.

  • We analyse the representative data of last -layers as a key point to present robust features of input data. We provide an ablation study the impact of proposed features processing.

  • The proposed model prototype proofed the concept of the approach and let the door open in future to find the best layers and the best (or ) CNNs combinations to build the detector.

  • Unlike most of detection methods, the implemented prototype shows that it is fully robust on MNIST and partially robust on CIFAR10 when attacked with HCAs. For instance, Local Intrinsic Dimensionality (LID) method [44] reported very high detection accuracy on the tested attacks, but fails on HCAs [3, 8].

2 Related work

2.1 Detection methods

Detection techniques can be classified according to the presence of AEs in the detector learning process into supervised and unsupervised techniques [7]. In supervised detection, detectors include AEs in the learning process. Many approaches are there in the literature. In the feature-based approach [52, 21, 48, 41], detectors use clean and AEs inputs to built their classifier models from scratch or by using the representative layers outputs of a DNN model. For instance, in [41]

, the detector quantizes the last ReLU activation layer of the model and builds a binary

Support Vector Machine (SVM) with Radial Basis Function (RBF) classifier. As reported in [41], this detector is not robust enough and is not tested against strong attacks like Carlini-Wagner (CW) attacks. While the work in [21] added a new adversarial class to the Neural Network (NN) model and train the model from scratch with clean and adversarial inputs. This architecture reduces the model accuracy  [21]. In the statistical-based approach [16, 44], detectors perform statistical measurement to define the separation between clean and adversarial inputs. In [16]

, kernel-density estimation, Bayesian uncertainty

NN, or combined models are introduced. Kernel-density feature is extracted from clean and AEs in order to identify AEs that are far away from data manifold while Bayesian uncertainty feature identifies the AEs that lie in low-confidence regions of the input space. LID method is introduced in [44] as the distance distribution of the input sample to its neighbors to assess the space-filling capability of the region surrounding that input sample. The works in [8, 3], showed that these methods can be broken. Finally, the network invariant approach [48, 15], in which the differences in neuron activation values between clean input samples and AEs are learned to build a binary NN detector. The main limitation of this approach is that it requires prior knowledge about the attacks and hence it might not be robust against new/unknown attacks.

On the another hand, in unsupervised detection, detectors are trained with clean images to identify the AEs. It is also known as prediction inconsistency models since it depends on the fact that AEs might not fool every NN model. That’s because the input feature space almost limited and adversary always takes that as an advantage to generate the AEs. Hence, unsupervised detectors try to reduce this limited input feature space available to adversaries. To accomplish this, many approaches have been presented in the literature. The FS approach [75] measures the distance between the predictions of the input and the same input after squeezing and the input will be adversarial if the distance exceeds a threshold. The work in [75] squeezes out unnecessary input features by reducing the color bit depth of each pixel and spatial smoothing of adversarial inputs. As reported in [75], it is not performing well with some known attacks like Fast Gradient Sign Attack (FGSM). Instead of squeezing, denoising based approach, like MagNet [47], measures the distances between the predictions of input samples and denoised/filtered input samples. It was found in [9, 37] that MagNet can be broken and not scaled to large images. Recently, a network/model invariant approach is introduced [43]. They proposed a Neural-network Invariant Checking (NIC) method that builds a set of models for individual layers to describe the provenance and the activation value distribution channels. It was observed that AEs affect these channels. The provenance channel describes the instability of activated neurons set in the next layer when small changes are present in the input sample while the activation value distribution channel describes the changes with the activation values of a layer. The reported performance of this method showed its superiority against other state-of-the-art models but other works reported that the performance of baseline detectors is not consistent [7], increases the model parameters overhead [42], is time-consuming [43], and increases the latency in the inference time [17].

2.2 Selective Prediction

Selective prediction models are one way to reject anomaly inputs. Anomaly inputs are not necessarily adversarial attacks. It might be an input that doesn’t belong to the distribution of the input space. We can consider adversarial attacks as a special use case of anomaly inputs. Most anomaly prediction models depend on building auto-encoders [61, 79, 80] in which the loss of the reconstruction output is used as a score to selectively reject inputs. Generative Adversarial Networks (GAN[20] is another approach for detecting anomalies in which a generator and a discriminator are built. The generator learns to generate similar data to training data and the discriminator is trained to discriminate the unreal data that comes from the generator from the real training data. Another approach for anomaly prediction is the probabilistic approach [18] in which a probabilistically-calibrated selective classifier is built using any certainty estimation function for a given NN model. In [18], a threshold over the selective confidence is obtained by end-to-end optimizing both classification and rejection simultaneously in a multi-task setting. According to the author knowledge, this approach has not been considered in an adversarial attacks detection models.

3 Selective and Feature based Adversarial Detection (Sfad) Method

Let be an input space, e.g. images, and a label space, e.g. classification labels. Let be the data distribution over . A model, , is called a prediction function,

is a given loss function. Given a labeled set

sampled i.i.d. from , where is the number of training samples. According to [18], the true risk of the prediction function w.r.t. is while the empirical risk of the prediction function is .

3.1 Detector design

It is believed that the last -layers in the DNN have potentials in detecting and rejecting AEs [46, 66]. In [4] and [46], only the last layer () is utilized to detect AEs. At this very high level of presentation, AEs are indistinguishable from samples of the target class. This observation is enhanced when Deep Neural Rejection (DNR[66] used the last three layers to build SVM with RBF kernel based classifiers. Unlike other works, in this work, 1) the representative of the last layer outputs , as features, are processed. In the aforementioned methods, the representative of the last layers are not processed and basically the detectors represent another approximation of the baseline classifier which is considered as a weak point. 2) Multi-Task Learning (MTL) is used. MTL has an advantage of combining related tasks with one or more loss function(s) and it does better generalization especially with the help of the auxiliary functions. For more details about MTL, please refer to these recent review papers [60, 71]. 3) selective prediction concept is utilized. In order to build safe DL models, prediction uncertainties, and , have to be estimated and a rejection mechanism to control the uncertainty has to be identified as well. The work in [18] introduced a model named SelectiveNet. It is a three-headed network for end-to-end learning of selective classification models. It introduces a selective loss function that optimizes a specified coverage slice using a variant of the interior point optimization method [57]. In this section, a Selective and Feature based Adversarial Detection (SFAD) method is demonstrated. As depicted in Figure 2, SFAD consists of two main blocks (in grey); the selective AEs classifiers block and the selective knowledge transfer classifier block. Besides the DNN prediction, , the two blocks give as output 1) detector prediction probabilities, , and , and selective probabilities, and . The detection blocks (in red) take these probabilities to identify adversarial status of an input .

Figure 2: Detector architecture. -last representative output of DNN is used to build Selective Adversarial Example classifiers. The confidence output of the classifiers is concatenated to be as input for Selective Knowledge Transfer classifier. Selective confidences ( and ) is used for Selection detection. is used for confidence detection and in ensemble detection with . The total detection is the ensemble of the three detection modules.

3.2 Selective AEs classifiers block

a) Prediction Task: Unlike works in [46, 66], we process the representative last -layer(s) outputs with different ways in order to make clean input features more unique. This will limit the feature space that the adversary uses to craft the AEs [75, 47]. Each of last -layer output has its own feature space since the perturbations propagation became clear when DNN model goes deeper. That makes each of the classifiers to be trained with different feature space. Hence, combining and increasing the number of will enhance the detection process. The aim of this block is to build individual classifiers. Figure 3 shows the architecture of one classifier. The input of each classifier is one or more of -layers representative outputs. For simplicity and as recommended in [66], we set in the implemented prototype and each individual layer output is assigned to a classifier as shown in Figure 3. Let the last -layers representations outputs () of from at these layers are , , and , respectively, where, . are individually the inputs of the NNs. The outputs of these NNs are denoted as . Each classifier represents a function on a distribution over and optimizes the loss function

(1)
Figure 3: Selective AEs detection classifier’s architecture.

As depicted in Figure 3 each selective classifier consists of different processing blocks; auto-encoders block, up/down-sampling block, bottleneck block, and noise block. These blocks aim at giving distinguishable features for input samples to let the detector recognises the AEs efficiently. Auto-encoders. Auto-encoders are widely used as a reconstruction tool and its loss is used as a score for different tasks. For instance, it used in the detection process of AEs in [47]. It is believed that AEs gave higher reconstruction loss than clear images. This process is a.k.a attention mechanism [77, 78] and it is used to focus on better representation of input features especially on the shallow classifiers.

Up/down-sampling. Up sampling and down sampling are used in different deep classifiers [72, 38]. The aim of down sampling, a.k.a pooling layers in NN, is to gather the global information of the input signal. Hence, if we consider the clean input signal as a signal that has global information and then we expand the global information by bi-linear up sampling and then down sample by average pooling, we will measure the ability of global information reconstruction of the input signal. Besides, this process can be seen as a use case of reconstruction process.

Noise. Adding noise has a potential effect in making NN more robust against AEs and it has been used in many defense methods [36, 39, 40]. In this work, we add a branch in the classifier that adds small Gaussian noise to the input signal before and after the auto-encoder block. Then, the noised and clean input features are concatenated before the bottleneck block.

Bottleneck. The bottleneck block [26] consists of three convolutional layers; 11, 33, and 11 convolutional layers. The bottleneck name came from that the 33 convolutional layer is left as bottleneck between 11 convolutional layers. It is mainly designed for efficiency purposes but according to [77, 76] it is very effective in building shallow classifiers which helps having better representation of input signal.

b) Selective Task: The aim of this task is to train the prediction task with the support of selective prediction/rejection as shown in Figure 3. The selective rejection is originally designed for non-adversarial anomaly inputs. In this paper, we consider AEs as a special case of anomalies and we follow the selective model that is proposed in [18]. For more details about SelectiveNet, readers are referred to [18]. It optimizes the loss function

(2)

where is a prediction function, and is a selection function for , is the target coverage, is a hyper-parameter controlling the relative importance of the constraint, and is a quadratic penalty function.
As depicted on Figure 3, the input of the selective task is the last layer representative output of the prediction task

. The selective task architecture is simple. It consists of one dense layer with ReLU activation and batch normalization (BN) layers followed by special Lambda layer that divides the output of BN by 10. Then it followed by one output dense layer

with sigmoid activation.

Figure 4: Selective knowledge transfer detection classifier’s architecture.

c) Auxiliary task: In the MTL models [60, 71], auxiliary task mainly comes to help generalizing the prediction task. Most of the MTL models focus on one main task and other/auxiliary tasks must be related. Our main task in the classifier is to train a selective prediction for the input features and in order to optimize this task, low-level features have to be accurate for the prediction and selective tasks and not to be over-fitted to one of these tasks. Hence, the original prediction process is considered as an auxiliary task.
Auxiliary network, , is trained using the same prediction task as assigned to with the standard loss function,

(3)

d) The overall loss: Finally, we optimize the overall loss function by a linear combination of the selective prediction loss and the auxiliary loss, Eq. (2) and (3) as

(4)

Studying the value of is out of paper scope, but other task balancing methods, may be applied like, uncertainty [28], GradNorm [13], DWA [38], DTP [23], and MGDA [62].

3.3 Selective knowledge transfer block

The idea behind this block is that each set of inputs (the confidence values of classes of the th classifier) is considered as a special feature of the clean input and combining different sets of the these features makes the features more robust. Hence, we transfer this knowledge of clean inputs to the classifier. Besides, in the inference time, we believe that AE will generate different distribution of the confidence values and if it was able to fool one classifier (selective AEs classifier), it may not fool the others.

a) Prediction Task: As shown in Section 3.2, the output of the prediction task of the selective AEs classifier is class prediction values . These confidence values are concatenated with outputs from prediction tasks to be as input of the selective knowledge transfer block as illustrated in Figure 4. Its classifier consists of one or more dense layer(s) and yields to confidence values for classes . The prediction task represents a function on a distribution over . Hence, it optimizes the following loss function

(5)

b) Selective Task: As depicted in Figure 4, the selective task is also integrated in the knowledge transfer classifier to selectively predict/reject AEs with the following loss function:

(6)

c) Auxiliary task: Similar to selective AEs classifiers, knowledge transfer classifier has the auxiliary network network, , that is trained using the same prediction task as assigned to with the standard loss function,

(7)

d) The overall loss: Finally, we optimize the overall loss function by a linear combination of the selective prediction loss and the auxiliary loss of the knowledge transfer classifier, Eq. (6) and (7) as, is

(8)

3.4 Detection process

As depicted in Figure 2, the outputs of each selective AEs classifiers and the selective knowledge transfer classifier are the confidence values for the classes ( and ) in addition to one confidence value for the selective modules ( and ). The output of baseline DNN model is .

  1. Thresholds: In the testing process, the following thresholds are defined; Confidence threshold and selective thresholds and . Following the steps in [66], we select our thresholds at a level when 10% (at most) of clean inputs can be rejected. From each selective AEs classifier, a threshold is calculated and, then, from the selective knowledge transfer classifier, a threshold is calculated as well. The final confidence threshold value is . Moreover, from each selective AEs classifiers and from selective knowledge transfer classifier, thresholds and are calculated respectively.

  2. Confidence detection: is set to if and is set to otherwise.
    c) Selective detection: is set to if or or … or and is set to otherwise.
    d) Ensemble detection: is set to if and is set to otherwise.

  3. Final detection: The input sample is adversarial if it is detected in confidence, selective, or ensemble detection process.

4 Experimental Results and Discussion

4.1 Experimental settings

4.1.1 Datasets

The proposed prototype is evaluated on CNN models trained with two popular datasets; MNIST [35] and [31]

CIFAR10. For ImageNet test, we proofed the concept of the paper with small set of ImageNet that is defined in adversarial attacks and defenses competition 

[32], NIPS’2017. MNIST is hand-written digit recognition dataset with 70000 images (60000 for training and 10000 for testing) and ten classes and CIFAR10 is an object recognition dataset with 60000 images (50000 for training and 10000 for testing) ten classes. ImageNet NIPS 2017 dataset is 5000 images with 1000 classes for training and 1000 images for testing.

4.1.2 Baseline classifiers

For the baseline models, two CNN models are trained; one for MNIST and one for CIFAR10. For MNIST, we trained 6-layer CNN with 98.73% accuracy while for CIFAR10 we trained 8-layer CNN with 89.11% accuracy. The classifier’s architecture for MNIST and CIFAR10 are shown in Table 1 and Table 2, respectively. For ImageNet, we used ResNet50v2 [26] CNN with 76% accuracy.

Layer Description
Conv2D + ReLU 32 filters ()

Conv2D + ReLU + Max Pooling(

)
32 filters ()
Conv2D + ReLU 64 filters ()
Conv2D + ReLU + Max Pooling() 64 filters ()
Dense + ReLU + Dropout () 256 units
Dense + ReLU 256 units
Softmax 10 classes
Table 1: MNIST baseline classifier architecture
Layer Description
Conv2D + BatchNorm + ReLU 64 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() + Dropout () 64 filters ()
Conv2D + BatchNorm + ReLU 128 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() + Dropout () 128 filters ()
Conv2D + BatchNorm + ReLU 256 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() + Dropout () 256 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() + Dropout () 512 filters ()
Dense 512 units
Softmax 10 classes
Table 2: CIFAR10 baseline classifier architecture

In order to evaluate the proposed prototypes against gray-box attacks, we consider that the adversaries know the training dataset and the model outputs and do not know the baseline model architectures. Hence, Table 3 and Table 4 show the two alternative architectures for MNIST and CIFAR10 classifiers. For MNIST, the classification accuracy are 98.37% and 98.69% for Model #2 and Model #3, respectively. While for CIFAR10, the classification accuracy are 86.93% and 88.38% for Model #2 and Model #3 respectively.

Model Layer Description

Model #2

Conv2D + BatchNorm + ReLU 64 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() + Dropout () 64 filters ()
Dense + BatchNorm + ReLU + Dropout () 128 units
Softmax 10 classes

Model #3

Conv2D + ReLU + Max Pooling() 32 filters ()
Conv2D + ReLU + Max Pooling() 64 filters ()
Dense + ReLU + Dropout () 256 units
Dense + ReLU 256 units
Softmax 10 classes
Table 3: MNIST classifiers architectures for gray-box setting
Model Layer Description

Model #2

Conv2D + BatchNorm + ReLU 32 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() 32 filters ()
Conv2D + BatchNorm + ReLU 64 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() 64 filters ()
Conv2D + BatchNorm + ReLU 128 filters ()
Conv2D + BatchNorm + ReLU + Max Pooling() + Dropout () 128 filters ()
Dense + BatchNorm + ReLU + Dropout () 512 units
Softmax 10 classes

Model #3

Conv2D + ReLU 64 filters ()
Conv2D + ReLU + Max Pooling() 64 filters ()
Conv2D + ReLU 128 filters ()
Conv2D + ReLU + Max Pooling() 128 filters ()
Dense + ReLU + Dropout () 256 filters ()
Dense + ReLU 256 filters ()
Softmax 10 classes
Table 4: CIFAR10 classifiers architectures for gray-box setting

4.1.3 Sfad Settings

As described in Section 3, Figure 3, and Figure 4, we introduce here the implementation details for the detector components.

4.1.4 Selective AEs classifiers block

It consists of autoencoder, up/down sampling, bottleneck, and noise layers. Each has the following architecture:


Autoencoder. As shown in Figure 5, let the input size to be . In the encoding process, the number of -kernel filters are set to , , and , respectively. In the decoding process, the number of filters are symmetrically restored. Finally, to maintain the input samples characteristics that we has before autoencoding, the input is added/summed to the output of autoencoder.

Figure 5: Autoencoder architecture

Up/down-sampling. As shown in Figure 6, let the input size to be . The input size is doubled by bilinear up sampling in the first two consecutive layers and then restored by average pooling in the last two layers. Finally, to maintain the features before up/down sampling, the input features is added to the output of up/down-sampling.

Figure 6: Up/down-sampling architecture

Bottleneck. It is a three-convolutional layer module with kernels of size , , and . The architecture of the bottleneck layers are shown in Figure 7. The number of filters for each layer is 1024, 512, and 256.

Figure 7: Bottleneck architecture

Noise.

For this layer, the GaussianNoise layer model from Keras library is used with small standard variation of 0.05.

Dense layers.

A dense layer with 512 output is used followed by batch normalization and ReLU activation function.

Selective task/prediction. A dense layer with 512 outputs is used followed by batch normalization and ReLU activation function. After that, as original SelectiveNet’s implementation suggests, a layer that divides the result of the previous layer by 10 is used as a normalization step. Finally, a dense layer of one output is used with sigmoid activation function. We set , for MNIST and for CIFAR10, and coverage threshold to 0.995 for MNIST and 0.9 for CIFAR10. More details about selectiveNet hyper-parameters are found in [18].

Selective Knowledge Transfer block. It consists of one dense layer with 128 outputs followed by batch normalization and ReLU activation function. The selective task of the knowledge transfer block consists of a dense layer with 128 outputs followed by batch normalization and ReLU activation function. After that a normalisation layer that divides the result of the previous layer by 10 is used as recommended by original implementation of SelectiveNet. Finally, a dense layer of one output is used with sigmoid activation function. We set , for MNIST and for CIFAR10, and coverage threshold to 0.7 for MNIST and CIFAR10. More details about selectiveNet hyper-parameters are found in [18]. For ImageNet, the knowledge transfer module consists of two dense layers with 2048 and 4096 outputs each followed by batch normalization and ReLU activation function. The dense layer output of selective module is set to 512.

4.1.5 Threat Model and Attacks

We follow one of the threat model presented in [5, 8]; Zero-Knowledge adversary threat model. It is assumed that the adversary has no knowledge that a detector is deployed and he generates the white-box attacks with the knowledge of the baseline classifier. For cases when adversary has perfect or limited knowledge of the detector, we assume that the adversary work will be so hard since SFAD adopts ensemble detection, and hence, we leave this as future work. Instead, we tested detectors robustness with the recommended high confidence strong attack, variant of CW-, that is rarely tested in other detectors.

Attacks. We tested the proposed model with different types of attacks. For white-box attacks; -norm attacks are used; FGSM [19], PGD [45], and CW [10] attacks. Besides -norm DeepFool (DF[49] attack is used as well in testing process. For FGSM and PGD attacks, the epsilon is set to different values from 0.03 to 0.4. For the CW attack, the number of iterations is set to be 1000 (strong case). Other attacks generating hyper-parameters are set to default as defined in the ART [53] library. The High Confidence Attack [8] is also used to test the robustness of the proposed model. For the black-box attacks, Threshold Attack (TA[29], Pixel Attack (PA[68], and Spatial Transformation attack (ST[14] attacks are used in the testing process. The threshold parameter for TA and PA is set to 10 for MNIST and to ART defaults for CIFAR. The translation and rotation values of ST attack are set to 10 and 60 for MNIST and to 8 and 30 for CIFAR, respectively. For the comparison with the state of the art algorithms, more black box attacks are considered like Square Attack (SA[2], HopSkipJump attack (HSJA) [11]. For SA attack, the epsilon () is set to 0.1 to 0.4. For HopSkipJump attack, untargeted and unmasked attack is considered, besides, 40 and 100 are set for iterations steps and maximum evaluations, respectively.

4.1.6 Comparison with existing detectors

Supervised and unsupervised detectors are used to compare SFAD with state-of-the-art techniques. Supervised methods like LID [44], and RAID [15] are compared with SFAD. While unsupervised methods like FS [75], MagNet [47], NIC [43], and DNR [66] are also considered in the comparisons.

FGSM
(0.05)
FGSM
(0.075)
FGSM
(0.1)
FGSM
(0.2)
FGSM
(0.4)
FGSM
(AVG)
PGD
(0.05)
PGD
(0.075)
PGD
(0.1)
PGD
(0.2)
PGD
(0.4)
PGD
(AVG)
DF CW HCA
Baseline
DNN
96.31 92.93 87.2 28.04 7.91 - 95.18 85.84 56.91 0 0 - 4.68 38.97 24.48
Selective
Detection
18.82 25.02 33.08 82.29 98.4 51.52 20.49 32.91 51.81 58.07 48.17 42.29 95.7 43.71 99.58
Confidence
Detection
8.66 13.74 20.75 74.18 98.66 43.2 10.31 21 42.73 54.14 47.41 35.12 94.96 42.78 98.59
Ensemble
Detection
3.07 5.66 9.77 43.43 63.93 25.17 3.85 10.5 28.61 26.86 18.69 17.7 59.47 44.99 57.93
Prediction 76.99 68.64 58.74 9.03 0 42.68 74.34 57.72 31.69 0 0 32.75 0.19 37.44 0.12

MNIST

Total 99.96 99.88 99.62 97.86 99.8 99.42 99.94 99.63 98.4 68.09 58.93 85 99.33 98.65 99.99
Baseline
DNN
14.09 13.44 12.25 10.5 9.75 - 0.43 0.28 0.22 0.16 0.17 - 4.79 20.95 26.33
Selective
Detection
41.31 46.8 53.78 71.34 9.83 44.61 33.6 33.18 30.52 22.66 18.65 27.72 39.06 37.41 19.48
Confidence
Detection
69.3 80.61 87.35 99.09 99.99 87.27 43.33 54.69 60.86 66.83 67.68 58.68 85.76 65.47 36.42
Ensemble
Detection
25.49 35.15 42.88 44.41 59.66 41.52 0 0 0.02 0.11 1.32 0.29 37.5 34.42 10.08
Prediction 6.94 3.28 1.39 0.02 0 2.33 0.34 0.13 0.07 0.06 0.04 0.13 1.35 20.09 14.15

CIFAR

Total 79.01 85.12 89.81 99.43 100 90.67 57.91 63.72 66.74 68.83 68.78 65.2 89.8 90.02 56.02
Table 5: Performance accuracy (%) of different detection processes against white-box attacks() on MNIST and CIFAR10 datasets at FP=10%. Prediction row is related to baseline DNN classification’s accuracy of not detected AEs. Total accuracy = Detection + Prediction).
Attack/Model Ablation Study (NN) Proposed
Baseline DNN Selective Detection Confidence Detection Prediction Total Selective Detection Confidence Detection Prediction Total
FGSM(0.01) 13.96 48.99 4.38 17.63 71.01 63.2 10.41 11.01 74.2
FGSM(0.02) 9.82 53.14 4.14 13.37 70.65 67.1 11.24 8.52 75.62
FGSM(0.05) 8.28 58.46 4.26 9.11 71.83 73.25 13.37 7.1 80.36
FGSM(0.075) 8.64 59.76 5.09 7.93 72.78 76.09 13.14 6.27 82.37
FGSM(0.09) 8.52 59.17 4.97 7.93 72.07 77.4 13.73 5.8 83.2
FGSM(0.12) 8.4 57.28 5.44 7.81 70.53 81.3 12.43 4.73 86.04
PGD(0.01) 1.3 36.33 5.8 8.76 50.89 56.69 13.73 4.97 62.84
PGD(0.02) 0.71 23.67 7.93 4.85 36.45 59.05 16.21 2.6 62.49
PGD(0.05) 0.12 10.53 6.75 2.96 20.24 62.6 21.18 0.47 64.85
PGD(0.075) 0.24 7.69 5.8 2.6 16.09 62.49 23.31 0.83 65.44
PGD(0.09) 0.12 7.22 5.44 2.25 14.91 65.21 24.02 0.47 67.57
PGD(0.12) 0.12 5.56 3.67 2.13 11.36 65.09 23.08 0.36 67.46
DF 40.24 46.27 15.38 22.96 71.72 59.29 11 19.88 79.29
CW 9.7 58.7 11.36 4.85 63.67 58.7 11.36 16.45 75.27
Table 6: Performance accuracy (%) of different detection processes against white-box attacks() for ImageNet NIPS 2017 dataset at FP=10%. Prediction row is related to baseline DNN classification’s accuracy of not detected AEs. Total accuracy = Detection + Prediction).

4.2 Results and Discussion

In this section, we test a prototype of the SFAD technique against different types of attack scenarios and datasets and then, we provide a comparison discussion with state-of-the-art detectors. Performance results on successful attacks only are also discussed. Besides, the proposed approach is tested with different settings and the results are discussed. In order to emphasize advantages of the SFAD modules components, we provide an ablation study to each component. Finally, performance results or different rejection rates, i.e. false positive rates are shown. As a reminder, we use only the last three representative layers () and to build three selective AEs classifiers since the aim is to prove the concept of the approach and if that is changed with the best combination, the detector accuracy will be enhanced accordingly.

4.2.1 Zero-Knowledge (of detectors) adversary white-box attacks

Table 5 shows selective, confidence, and ensemble detection accuracy of SFAD prototype for MNIST and CIFAR10 datasets. It also shows the baseline DNN prediction accuracy for the AEs in “Baseline DNN” row and for the not detected AEs in “prediction” row. The “Total” row is the total accuracy of detected and truly classified/predicted samples. In general, our approach achieves a comparable performance with state of the art methods. For MNIST dataset, the FGSM attacks with small epsilon () slightly fooled the baseline classifier and hence their feature space still inside or at the border of training dataset. The detector shows its ability to reject those samples that are so close to the classes borders and achieves the accuracy of 99.96%, 99.88%, and 99.62% for , respectively. The same applies to PGD attacks with small values. For larger values, DF, CW, and HCA attacks, the AEs are highly able to fool the baseline classifier since adversaries are able to change the MNIST test samples features space to lie out of its corresponding class border and hence, for all tested attacks except the PGD, the model was able to catch them with accuracy above 98.65%. While the detector achieves 68.09% and 58.93% for PGD attacks with , respectively. Some PGD examples features space became indistinguishable from the trained samples features. That makes the detector unable to catch them because not the best representative layers combination is used as input for the detector. For CIFAR10 dataset, the model achieves a comparable results with state-of-the-art methods for FGSM (), DF, and CW attacks. While for FGSM () and PGD attacks, the AEs have, to some extent, indistinguishable feature space than those the detector is trained with. In average, the model achieves accuracy of 65.2% for PGD attacks. The HCA [8] is mainly generated to bypass detectors, our approach shows full robustness on MNIST and partial robustness on CIFAR10 to these attacks while other detectors fail although it achieves 100% accuracy against PGD attacks. For both datasets, the effectiveness of selective and confidence detection is obvious. The ability of the two modules to detect the AEs is increasing when the amount of the perturbations is increasing. When the amount of the perturbations increased in a way that makes the adversarial samples feature space indistinguishable from the training dataset, the ability of these modules to detect the AEs decreasing. Results show that ensemble detection has small add value (up to 3%) in all tested attacks except for PGD() and CW on MNIST. It has added accuracy of 6% and 10% respectively.

4.2.2 ImageNet NIPS’2017 results

In this section, we show the performance results of the proposed approach on ImageNet sub dataset that is identified in adversarial attacks and defenses competition [32]. The performance results not representing the real performance since the dataset used in detector training is small and not enough to generalize for all ImageNet dataset images. Instead, we have had enough that to prove the proposed approach’s concept. The only change we did in training the detector for such a small dataset is that the selective AEs classifiers and the selective knowledge transfer classifier are end-to-end trained in one stage, as shown in Figure 2 instead of two stages for MNIST and CIFAR-10 datasets.

Table 6 shows the performance evaluation of the proposed approach with white-box scenario. As a reminder, the baseline classifier is ResNet50 V2 classifier. The values for FGSM and PGD are set to keep added perturbations not visible to humans. Like other datasets, same arguments are applied to this experiment, selective and confidence modules show their ability to detect AEs with average detection rate of 80.3%, 65.11%, 79.29%, and 75.27% for FGSM, PGD, DF, and CW attacks. When all the proposed processing blocks are removed from the model, the performance is significantly decreased. More investigations are needed to select the best combination of last layers to train selective classifiers to enhance the detector accuracy.

4.2.3 Black-box attacks

Table 7 shows SFAD prototype accuracy of the TA [29], PA [68], and ST [14] attacks on MNIST and CIFAR10 datasets. The detector is able to catch the AEs with very high accuracy larger than 97.56% and 93.97% for MNIST and CIFAR10, respectively. It is clear that the selective, confidence, and ensemble modules complement each other. The black-box attacks significantly change the samples features that facilitate the confidence module detection process. While the ability of selective module is limited for TA and PA attacks since these attacks change one or more pixels within a threshold that in a variation of the input sample and yield AEs so close to clean samples.

Dataset Attack Baseline DNN Proposed
Selective Detection Confidence Detection Ensemble Detection Prediction Total
MNIST Threshold Attack 77.61 24.36 85.48 42.37 14.31 99.93
Pixel Attack 74.57 24.65 85.57 42.88 14.18 99.94
Spatial Transformation 22.04 86.7 80.95 34.71 2.85 97.59
CIFAR Threshold Attack 11.29 12.69 92.14 37.11 1.35 93.97
Pixel Attack 11.35 12.48 92.4 37.02 1.39 94.16
Spatial Transformation 52.58 44.64 68.16 32.44 24.03 96.57
Table 7: Performance accuracy (%) of different detection processes against black-box attacks on MNIST and CIFAR10 datasets at FP=10%. Prediction column is related to baseline DNN classification’s accuracy of not detected AEs. Total accuracy = Detection + Prediction).
Attack () Model#1 Model#2 Model#3
Total Selective Detection Confidence Detection Ensemble Detection Prediction Total Selective Detection Confidence Detection Ensemble Detection Prediction Total
FGSM(0.05) 99.96 11.94 4.31 0.57 86.14 100 12.98 5.01 0.87 84.72 100
FGSM(0.075) 99.88 13.71 5.3 1.08 83.77 100 15.63 6.65 1.68 81.14 99.98
FGSM(0.1) 99.62 16.3 6.76 1.66 80.41 100 19.48 9.17 2.62 76.29 99.95
FGSM(0.2) 97.86 46.72 30.18 12.77 48.25 99.74 57.56 47.39 24.75 35.59 98.91
FGSM(0.4) 99.8 96.84 96.39 58.97 0.41 99.82 95.88 98.81 55.88 0 99.89
PGD(0.05) 99.94 11.63 4.11 0.53 86.43 100 13.16 5.05 0.95 84.49 99.98
PGD(0.075) 99.63 12.9 4.67 0.92 84.99 100 16.34 7 1.74 80.21 99.98
PGD(0.1) 98.4 15.41 5.99 1.64 81.47 100 20.92 10.13 3.55 74.21 99.94
PGD(0.2) 68.09 47.2 32.61 15.18 44.44 99.34 60.85 54.55 30.25 23.85 97.11
PGD(0.4) 58.93 83.59 83.73 30.33 0.93 92.03 77.18 57.9 21.84 0.2 82.98
DF 99.33 90.33 84.72 43.46 7.37 99.62 91.78 91.26 57.84 2.59 98.62
CW 98.65 21.01 10.96 3.24 74.82 99.98 31.27 19.26 6.93 61.33 99.9
Table 8: Performance accuracy (%) of different detection processes against grey-box attacks() on MNIST dataset at FP=10%. Prediction column is related to baseline DNN classification’s accuracy of not detected AEs. Total accuracy = Detection + Prediction).
Attack () Model#1 Model#2 Model#3
Total Selective Detection Confidence Detection Ensemble Detection Prediction Total Selective Detection Confidence Detection Ensemble Detection Prediction Total
FGSM(0.05) 79.01 43.77 72.69 28.27 8.56 82.65 48.36 73.44 27.45 7.9 84.33
FGSM(0.075) 85.12 45.61 79 33.18 4.26 83.94 49.16 78.54 32.32 4.17 84.42
FGSM(0.1) 89.81 53.11 87.11 38.82 2.03 89.51 48.36 82.7 32.8 2.55 86.24
FGSM(0.2) 99.43 77.37 99.58 37.72 0.03 99.69 60.85 96.82 26.97 0.26 97.16
FGSM(0.4) 100 5.6 100 62.05 0 100 23.04 99.99 16.8 0 100
PGD(0.05) 57.91 6.55 18.95 3.46 5.62 26.32 7.25 16.13 4.35 4.44 23
PGD(0.075) 63.72 2.58 12.43 1.12 5.13 18.31 3.33 8.93 1.96 4.31 14.67
PGD(0.1) 66.74 1.93 11.95 0.91 5.07 17.67 2.48 8.14 1.96 4.3 13.9
PGD(0.2) 68.83 3.5 18.74 2.05 4.9 24.54 6.13 16.45 4.05 3.92 23.27
PGD(0.4) 68.78 8.19 29.39 5.57 4.35 36.11 23.73 48.56 13.81 2.37 56.58
DF 89.8 37.27 77.65 42.68 16.17 97.37 29.3 81.02 37.75 11.74 96.02
CW 90.02 24.86 40.69 11.3 53.29 98.64 31.7 50.07 14.82 42.18 97.24
Table 9: Performance accuracy of different detection processes against grey-box attacks() on CIFAR10 dataset at FP=10%. Prediction column is related to baseline DNN classification’s accuracy of not detected AEs. Total accuracy = Detection + Prediction).

4.2.4 Gray-box attacks

Gray-box scenario assumes that we knew only the model training data and the output of the DNN model and we did not know the model architecture. Hence, we trained two models as substitution models named Model#2 and Model#3 for MNIST and CIFAR-10 as shown in Table 3 and Table 4, respectively. Then, white-box based AEs are generated using the substitution models. The SFAD prototype is then tested against these AEs. For both datasets, it is shown in Table 8 and Table 9 that the perturbations properties generated from one model are transferred when tested with different models. For MNIST, see Table 8, the detector prediction rate is much better for PGD attacks and the prediction rate for other attacks is comparable with the prediction rate of AEs generated from Model#1. For CIFAR-10, see Table 9, the prediction rate for CW and DF attacks is higher than those attacks generated from Model#1, while the prediction rate for FGSM is comparable with the prediction rate for FGSM attacks generated from Mode1#1. Unlike other attacks, the PGD attacks transferable properties sound to be much stronger hence, AEs with different feature space compared to feature space of AEs generated from Model#1 are generated. This reduces the ability of the detector to catch such attacks.

White-box attacks Black-box attacks
FGSM (0.05) FGSM (0.075) FGSM (0.1) FGSM (0.2) FGSM (0.4) FGSM (AVG) PGD (0.05) PGD (0.075) PGD (0.1) PGD (0.2) PGD (0.4) PGD (AVG) DF CW HCA TA PA ST
MNIST Selective Detection 59.89 60.74 64.56 89.4 98.33 74.58 57.35 64.23 68.05 58.07 48.17 59.17 95.78 69.16 99.89 25.96 28.04 87.9
Confidence Detection 60.99 60.6 63.29 88.56 98.57 74.4 59.45 63.66 71.16 54.14 47.41 59.16 95 69.96 99.61 99.19 99.08 88.1
Ensemble Detection 77.2 74.21 71.28 59.03 62.95 68.93 74.16 70.17 64.25 26.86 18.69 50.83 58.87 73.73 67.84 86.84 86.82 40.96
Total 98.9 98.42 97.07 97.04 99.78 98.24 98.74 97.35 96.31 68.09 58.93 83.88 99.3 97.8 99.99 99.73 99.8 96.91
CIFAR Selective Detection 44.7 49.05 54.7 70.16 10.2 45.76 33.74 33.25 30.6 22.68 18.68 27.79 39.12 47.32 19.22 10.92 10.84 60.08
Confidence Detection 72.57 81.45 87.35 99.03 99.99 88.08 43.44 54.69 60.84 66.83 67.68 58.7 86.53 81.74 34.26 92.78 93.05 87.96
Ensemble Detection 27.62 36.19 42.58 42.67 58.52 41.52 0 0 0.02 0.11 1.33 0.29 37.79 43.54 10.27 39.15 38.92 54.35
Total 75.57 82.82 88.4 99.36 100 89.23 57.74 63.62 66.66 68.71 68.74 65.09 89.33 87.38 40.32 93.21 93.42 92.76
Table 10: Detection accuracies (%) against successful white-box and black-box attacks() on MNIST and CIFAR10 datasets at FP=10%. Prediction row is related to baseline DNN classification’s accuracy of not detected AEs. Total accuracy = Detection of detection modules).

4.2.5 Performance on successful attacks only

Table 10 shows only the detection performance for those AEs that fool the baseline DNN classifier for white-box and black-box attack scenarios. For MNIST, in general, a comparable results with the state-of-the-art detectors are achieved for all tested white and black boxes attacks () except for the PGD attacks (83.88%). The robustness against HCA [8] is full on MNIST dataset and partial on CIFAR10 dataset. For both datasets, the impact of selective, and confidence detection is obvious. The ability of the modules to detect the AEs is increasing when the amount of the perturbations is increasing. When the amount of the perturbations increases in a way that makes the adversarial samples feature space indistinguishable from the training dataset, the ability of these modules to detect the AEs decreases. Ensemble detection shows an integrate role for the detection process of AEs with small perturbations. Once the amount of crafted added noise became high, the performance of ensemble detection is decreasing. That’s because the detector classifiers and baseline DNN classifier trained for the same prediction task and their behavior for highly degraded AEs is similar.

Figure 8: Total model performance accuracy (%) for black and white box scenarios on MINIST and CIFAR-10 datasets at FP=10% with different selective AEs classifiers settings

4.2.6 Results with last layer(s) output(s)

Results shown in Figure 8 emphasize the conclusions in [66] that recommend to use more than one layer from last layers of the baseline DNN classifier to be used in detection techniques. For MNIST dataset, the benefit of using more than one layer appears in detecting PGD (), TA, and PA attacks, while it appears in all tested attacks on CIFAR-10 dataset. It means that low-/and medium-level layers hold features that trigger when small perturbations are added to input samples.

Attack/Model Baseline DNN NN Only noise Only auto encoder Only up/down sampling Only bottleneck No noise No auto encoder No up/down sampling No bottleneck Proposed
FGSM(0.05) 96.31 100 99.97 100 100 99.92 99.95 99.96 99.95 99.97 99.96
FGSM(0.075) 92.93 99.98 99.94 99.96 99.97 99.81 99.76 99.9 99.8 99.92 99.88
FGSM(0.1) 87.2 99.9 99.93 99.9 99.93 99.69 99.49 99.79 99.51 99.86 99.62
FGSM(0.2) 28.04 99.61 99.44 98.97 98.78 96.71 96.37 96.96 97.79 98.79 97.86
FGSM(0.4) 7.91 98.98 98.37 98.65 96.14 97.85 93.73 99.87 98.81 97.01 99.8
PGD(0.05) 95.18 100 99.95 99.97 99.98 99.88 99.87 99.93 99.91 99.95 99.94
PGD(0.075) 85.84 99.83 99.86 99.84 99.86 99.58 99.43 99.73 99.57 99.76 99.63
PGD(0.1) 56.91 99.35 99.2 99.06 99.09 98.09 97.65 98.43 97.7 98.92 98.4
PGD(0.2) 0 60.06 59.33 66.24 53.73 66.18 57.9 61.39 63.44 63.93 68.09
PGD(0.4) 0 43.8 42.61 52.57 40.97 46.11 45.11 43.39 46.57 47.71 58.93
DF 4.68 97.71 97.35 98.44 95.15 98.51 96.88 99 98.83 98.62 99.33
CW 38.97 99.71 99.66 99.52 99.6 98.36 97.73 98.79 98.57 99.19 98.65
HCA 24.48 100 100 100 100 99.3 99.74 99.56 100 100 99.99
Table 11: Ablation performance (%) on white-box scenarios for MNIST dataset.
Attack/Model Baseline DNN NN Only noise Only auto encoder Only up/down sampling Only bottleneck No noise No auto encoder No up/down sampling No bottleneck Proposed
FGSM(0.05) 14.09 80.02 79.55 73 80.01 79.68 78.62 75.81 75.7 78.55 79.01
FGSM(0.075) 13.44 87.9 85.21 80.44 85.84 86.61 86.2 84.77 81.3 85 85.12
FGSM(0.1) 12.25 91.37 87.26 84.64 87.71 91 91.27 90.75 85.56 88.26 89.81
FGSM(0.2) 10.5 85.52 78.89 95.13 78.73 97.79 99.48 98.19 95.69 88.92 99.43
FGSM(0.4) 9.75 84.39 84.94 99.71 79.32 100 100 99.98 100 99.85 100
PGD(0.05) 0.43 0.42 0.43 4.21 0.42 1.08 36.61 7.8 48.34 12.58 57.91
PGD(0.075) 0.28 0.27 0.28 11.64 0.27 2.76 47.19 9.65 58.32 12.67 63.72
PGD(0.1) 0.22 0.22 0.22 17.72 0.22 5.22 52.68 12.09 61.49 13.42 66.74
PGD(0.2) 0.16 0.15 0.15 27.83 0.16 14.7 58.23 18.86 64.31 16.61 68.83
PGD(0.4) 0.17 0.17 0.17 33.74 0.16 22.75 60.23 23.9 65.4 18.21 68.78
DF 4.79 83.67 84.84 83.71 82.7 89.11 88.32 87.64 87.96 88.09 89.8
CW 20.95 88.98 89.05 87.03 88.33 89.46 88.63 87.7 88.14 89.26 90.02
HCA 26.33 51.99 51.35 47.58 50.65 52.9 63.09 56.83 52.09 53.09 56.02
Table 12: Ablation performance (%) on white-box scenarios for CIFAR-10 dataset.

4.2.7 Ablation study

In this section, we emphasize the advantage of the proposed approach’s components including noise, autoencoder, up/down sampling, and bottleneck blocks. Table 11 and Table 12 show the performance results for each block once when it is present alone and another time when it is absence for MNIST and CIFAR-10 datasets. In all settings, the selective task and knowledge transfer modules are present since this ablation study aims at identifying the additive values of features processing blocks.

Only NN. When all processing blocks are absence, the MNIST results show the ability to detect FGSM, PGD with small values, and CW attacks with slightly better than the proposed approach, while the proposed approach yields better results for DF and PGD with high values. Since CIFAR-10 dataset is different from MNIST and has different characteristics, the only NN component did not yield better results for FGSM with high values, PGD, CW, and DF attacks.

Noise. When only noise block is used, the model achieves a comparable results with SFAD except for PGD attacks. When we removed the noise block, the performance of SFAD is reduced specially for PGD attacks for MNIST and CIFAR-10 datasets. The noise block helps the detector to better distinguish the feature space of clean input image from those features of AEs.

Autoencoder. Autoencoder block shows a substantial effect in the proposed approach. As discussed in Section 3.2, if the autoencoder couldn’t reconstruct its input, different feature space might be generated for the input signal which let SFAD able to detect the AEs. For MNIST dataset, the autoencoder block enhanced the performance results compared to only NN model for PGD with higher values but the performance results are reduced when the autoencoder block is removed from the proposed approach. On the other hand, for the CIFAR-10 dataset, when only the autoencoder is present, the performance results are much better for FGSM with high values, PGD, CW, and DF attacks when it is compared to only NN and the performance is reduced when it is removed from the proposed approach for PGD attacks.

Up/down-sampling. Unlike other processing blocks, up/ down sampling block yields less performance results for FGSM attacks and comparable results for other attacks compared to only NN model. That’s because of the up/down-sampling restored the global information of the input signal by the average pooling process. On the other hand, removing the sampling block from the proposed approach reduces the performance results especially for the CIFAR-10 dataset.

Bottleneck. Like autoencoder block, the bottleneck block shows it is ability of distinguish input signal characteristics especially in the proposed shallow classifiers (the selective AEs classifiers). Compared to only NN model, the only bottleneck model enhanced the performance results of FGSM with high values, PGD, CW, and DF attacks for CIFAR-10 dataset and results of PGD with high values attacks for MNIST. Besides, the performance of the proposed approach is significantly decreased for CIFAR-10 dataset when the bottleneck block is removed.

Figure 9: Performance comparisons between different False Positive (FP) rates and FP=10% of SFAD for white-box attacks on MNIST dataset.

4.2.8 Performance with different rejection rates (False positive (FP))

In this subsection we show the performance results of the proposed approach when thresholds are set to reject less than 10% for MNIST as shown in Figure 9. Results for MNIST show that an acceptable performance can be achieved if the thresholds are set less than 10%. For instance, when the false positive rate set to be 2%, results of PGD () attacks are significantly decreased since the selective detection is remarkably enhancing the performance with higher FP while confidence threshold show small decrease in the performance. In all other tested attacks, the difference is up to 4% and 1.76% when FP=2% and 3%, respectively.

Dataset Detector Attacks
FPR White box Black box
FGSM PGD CW HCA DF SA HSJA STA
MNIST LID 0.81 77.46 77.03 64.43 84.51 93.3 42.78 61.52 93.81
FS 5.27 97.96 97.19 98.41 99.99 66.96 99.96 99.98 77.49
MagNet 0.20 100.00 100.00 40.56 100.00 96.99 99.93 98.32 1.61
NIC 10.12 100.00 100.00 100 100.00 100 99.68 100 99.83
DNR 10.01 79.67 59.21 57.98 89.90 95.6 81.27 59.98 88
SFAD (ours) 10.79 98.66 81.83 98.24 99.36 99.58 98.85 99.91 97.61
CIFAR10 LID 7.30 76.15 96.81 64.52 55.03 63.57 85.76 88.34 94.23
FS 5.07 32.50 4.20 56.18 56.31 39.18 17.82 84.16 22.46
MagNet 0.77 34.61 0.62 13.23 0.53 57.33 94.04 0.58 2.32
NIC 10.08 63.15 100.00 61.68 73.29 84.91 61.88 67.53 48.77
DNR 10.01 30.23 18.23 44.15 29.85 30.2 52.86 38.81 56.2
SFAD (ours) 10.90 80.14 41.20 87.68 45.85 89.57 93.91 95.57 92.9
Supervised detector, Unsupervised detector
Table 13: Comparison with the state-of-the-art detectors against white-box and black-box attacks. Top 3 are colored with red, blue, and green respectively.

4.2.9 Comparisons with state-of-the-art detectors

In this subsection we build a comparison with different types of supervised and unsupervised detectors using the detectors benchmark222The source code is available in https://github.com/aldahdooh/detectors_review and shown in Table 13. We compare with the average FGSM and PGD results. For fair comparisons, values of 0.125, 0.25, and 0.3125 are set for MNIST dataset, while for CIFAR10 is to 0.03 and 0.06. As discussed in Section 4.2.8, rejection/false positive rates of SFAD can be decreased with small compromise in the performance. For RAID [15] method, we compare with results reported in the original paper due to publicly unavailable code.

LID [44]. SFAD outperforms LID in both datasets and the tested attacks except for PGD attacks on CIFAR10. LID achieves better false positive rate compared to SFAD but it fails against High Confidence Attack as reported in  [8]. When LID trained with HCA as in our experiment it achieves better results than in [8]. Our approach provides full and partial robustness against HCA for MNIST and CIFAR10, respectively. In fact, LID needs not noisy clean and adversarial images to accurately train the detector to identify the boundaries between clean and adversarial inputs.

RAID [15]. For MNIST datasets, RAID achieves higher detection rate for PGD attacks () and higher detection rate for FGSM and PGD attacks for CIFAR10 while our approach improved the results for CW and DF attacks. Besides, RAID has better false positive rate for MNIST only. RAID train clean and adversarial inputs to identify differences in neuron activation between the clean and adversarial samples. Hence, it requires a huge knowledge of attacks and its variants to train them.

FS [75]. As stated in [75], FS requires high quality squeezers for different baseline networks and it was shown that FS is not performing well against tested attacks on CIFAR10 dataset, while our approach generalizes better than FS at the expense of higher false positive rate.

MagNet [47]. Results reported on Table 13 is for detection part only and considered the defense part of Magnet. For MNIST, a comparable results are achieved by our approach except for CW and ST attacks while for CIFAR, our approach outperforms MagNet against the tested attacks. Since MagNet is a denoiser-based detector, it not guaranteed that the denoisers will remove all the noise and have highly denoised inputs that respect the target threshold. This applies specifically to and attacks. On the contrary, our approach relies on confidence value changes that the AEs will cause which makes our approach able to identify AEs. Although MagNet yields to less false positive rate, it was shown that MagNet can be broken by different strategies [9].

NIC [43]. NIC is the state-of-the-art detector that achieves better performance, in general, against white box attacks compared to other detectors, while our approach achieves better performance against tested black box attacks. Unlike the proposed approach, other works reported that the performance of NIC baseline detectors is not consistent [7], increases the model parameters over-head [42], is time consuming [43], and has latency in the inference time [17].

DNR [66]. This work is close to our approach, but we include the feature processing and selective modules components. The reported results show that our approach outperforms DNR with the same false positive rates for MNIST and CIFAR10 datasets.

Other performance comparison: SFAD has middle complexity due to classifiers training times, and has no inference time latency, but it has a compromise on overhead due to classifiers parameters saving. Relative to other detectors, SFAD introduces shallow networks hence, compared to NIC, DNR, and LID, our detector has much less complexity. Besides, it works in parallel to the baseline classifier and no latency compared to FS and NIC. Finally, like, NIC and DNR, SFAD has to pay a little price in terms of overhead compared to MagNet, FS and LID.

5 Conclusion

In this work, we have proposed a novel mechanism, namely SFAD, to detect adversarial attacks. SFAD handled the -last layers outputs of the baseline DNN classifier to identify anomaly inputs. It built selective AEs classifiers each took one layer output of baseline classifier as input and then processed the input using autoencoders, up/down sampling, bottleneck layers and additive noise. Then these feature-based classifiers were optimized for selective prediction. The confidence values of these classifiers were then distilled as input to knowledge transfer classifier that was optimized with selective prediction as well. Selective and confidence thresholds were set to identify the adversarial inputs. Selective, confidence, and ensemble modules are jointly working to enhance the detection accuracy. We showed that the model is consistent and is able to detect tested attacks. Moreover, the model is robust in different attack scenarios; white, black, and gray boxes attacks. This robustness, with the advantage that the model does not require any knowledge of adversarial attacks, will lead to better generalization. The main limitation of the model is that a best combination of needs to be identified to enhance the detection accuracy and reduce false positive rates.

Acknowledgement

The project is funded by both Région Bretagne (Brittany region), France, and direction générale de l’armement (DGA).

References

  • [1] N. Akhtar and A. Mian (2018) Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, pp. 14410–14430. Cited by: §1.
  • [2] M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein (2020) Square attack: a query-efficient black-box adversarial attack via random search. In European Conference on Computer Vision, pp. 484–501. Cited by: §4.1.5.
  • [3] A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: 4th item, §1, §2.1.
  • [4] A. Bendale and T. E. Boult (2016) Towards open set deep networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1563–1572. Cited by: §3.1.
  • [5] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013)

    Evasion attacks against machine learning at test time

    .
    In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §4.1.5.
  • [6] T. Borkar, F. Heide, and L. Karam (2020) Defending against universal attacks through selective feature regeneration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 709–719. Cited by: §1.
  • [7] S. Bulusu, B. Kailkhura, B. Li, P. K. Varshney, and D. Song (2020) Anomalous instance detection in deep learning: a survey. arXiv preprint arXiv:2003.06979. Cited by: §1, §2.1, §2.1, §4.2.9.
  • [8] N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    ,
    pp. 3–14. Cited by: 4th item, §1, §2.1, §4.1.5, §4.1.5, §4.2.1, §4.2.5, §4.2.9.
  • [9] N. Carlini and D. Wagner (2017) Magnet and” efficient defenses against adversarial attacks” are not robust to adversarial examples. arXiv preprint arXiv:1711.08478. Cited by: §1, §2.1, §4.2.9.
  • [10] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §1, §4.1.5.
  • [11] J. Chen, M. I. Jordan, and M. J. Wainwright (2020) Hopskipjumpattack: a query-efficient decision-based attack. In 2020 ieee symposium on security and privacy (sp), pp. 1277–1294. Cited by: §4.1.5.
  • [12] P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. Cited by: §1.
  • [13] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2017) Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257. Cited by: §3.2.
  • [14] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry (2019) Exploring the landscape of spatial robustness. In International Conference on Machine Learning, pp. 1802–1811. Cited by: §1, §4.1.5, §4.2.3.
  • [15] H. F. Eniser, M. Christakis, and V. Wüstholz (2020) RAID: randomized adversarial-input detection for neural networks. arXiv preprint arXiv:2002.02776. Cited by: §1, §2.1, §4.1.6, §4.2.9, §4.2.9.
  • [16] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §1, §2.1.
  • [17] Y. Gao, B. G. Doan, Z. Zhang, S. Ma, A. Fu, S. Nepal, and H. Kim (2020) Backdoor attacks and countermeasures on deep learning: a comprehensive review. arXiv preprint arXiv:2007.10760. Cited by: §1, §2.1, §4.2.9.
  • [18] Y. Geifman and R. El-Yaniv (2019) SelectiveNet: A deep neural network with an integrated reject option. CoRR abs/1901.09192. External Links: Link, 1901.09192 Cited by: §1, §1, §2.2, §3.1, §3.2, §3, §4.1.4, §4.1.4.
  • [19] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §4.1.5.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2.
  • [21] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §2.1.
  • [22] S. Gu and L. Rigazio (2014) Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068. Cited by: §1.
  • [23] M. Guo, A. Haque, D. Huang, S. Yeung, and L. Fei-Fei (2018) Dynamic task prioritization for multitask learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–287. Cited by: §3.2.
  • [24] W. Guo, D. Mu, J. Xu, P. Su, G. Wang, and X. Xing (2018) Lemna: explaining deep learning based security applications. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 364–379. Cited by: §1.
  • [25] H. X. Y. M. Hao-Chen, L. D. Deb, H. L. J. T. Anil, and K. Jain (2020) Adversarial attacks and defenses in images, graphs and text: a review. International Journal of Automation and Computing 17 (2), pp. 151–178. Cited by: §1.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §3.2, §4.1.2.
  • [27] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pp. 125–136. Cited by: §1.
  • [28] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: §3.2.
  • [29] S. Kotyan and D. Vasconcellos Vargas (2019) Adversarial robustness assessment: why both and attacks are necessary. arXiv e-prints, pp. arXiv–1906. Cited by: §1, §4.1.5, §4.2.3.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [31] A. Krizhevsky (2012-05) Learning multiple layers of features from tiny images. University of Toronto, pp. . Cited by: §1, §4.1.1.
  • [32] A. Kurakin, I. Goodfellow, S. Bengio, Y. Dong, F. Liao, M. Liang, T. Pang, J. Zhu, X. Hu, C. Xie, et al. (2018) Adversarial attacks and defences competition. In The NIPS’17 Competition: Building Intelligent Systems, pp. 195–231. Cited by: §1, §4.1.1, §4.2.2.
  • [33] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1.
  • [34] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §4.1.1.
  • [36] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656–672. Cited by: §1, §3.2.
  • [37] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu (2018) Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1787. Cited by: §1, §2.1.
  • [38] S. Liu, E. Johns, and A. J. Davison (2019) End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1871–1880. Cited by: §1, §3.2, §3.2.
  • [39] X. Liu, M. Cheng, H. Zhang, and C. Hsieh (2018) Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 369–385. Cited by: §1, §3.2.
  • [40] X. Liu, T. Xiao, S. Si, Q. Cao, S. Kumar, and C. Hsieh (2020) How does noise help robustness? explanation and exploration under the neural sde framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 282–290. Cited by: §1, §3.2.
  • [41] J. Lu, T. Issaranon, and D. Forsyth (2017) Safetynet: detecting and rejecting adversarial examples robustly. In Proceedings of the IEEE International Conference on Computer Vision, pp. 446–454. Cited by: §1, §1, §2.1.
  • [42] J. Lust and A. P. Condurache (2020) GraN: an efficient gradient-norm based detector for adversarial and misclassified examples. arXiv preprint arXiv:2004.09179. Cited by: §1, §2.1, §4.2.9.
  • [43] S. Ma and Y. Liu (2019) Nic: detecting adversarial samples with neural network invariant checking. In Proceedings of the 26th Network and Distributed System Security Symposium (NDSS 2019), Cited by: §1, §2.1, §4.1.6, §4.2.9.
  • [44] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: 4th item, §1, §2.1, §4.1.6, §4.2.9.
  • [45] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1, §1, §4.1.5.
  • [46] M. Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli (2017) Is deep learning safe for robot vision? adversarial examples against the icub humanoid. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 751–759. Cited by: §1, §3.1, §3.2.
  • [47] D. Meng and H. Chen (2017) Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp. 135–147. Cited by: §1, §2.1, §3.2, §3.2, §4.1.6, §4.2.9.
  • [48] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff (2017) On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267. Cited by: §2.1.
  • [49] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1, §4.1.5.
  • [50] A. Mustafa, S. H. Khan, M. Hayat, J. Shen, and L. Shao (2019)

    Image super-resolution as a defense against adversarial attacks

    .
    IEEE Transactions on Image Processing 29, pp. 1711–1724. Cited by: §1.
  • [51] A. Nayebi and S. Ganguli (2017) Biologically inspired protection of deep networks from adversarial attacks. arXiv preprint arXiv:1703.09202. Cited by: §1.
  • [52] A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §2.1.
  • [53] M. Nicolae, M. Sinn, M. N. Tran, B. Buesser, A. Rawat, M. Wistuba, V. Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. Molloy, and B. Edwards (2018) Adversarial robustness toolbox v1.2.0. CoRR 1807.01069. External Links: Link Cited by: §4.1.5.
  • [54] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1.
  • [55] N. Papernot, P. McDaniel, and I. Goodfellow (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277. Cited by: §1.
  • [56] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §1.
  • [57] F. A. Potra and S. J. Wright (2000) Interior-point methods. Journal of Computational and Applied Mathematics 124 (1-2), pp. 281–302. Cited by: §3.1.
  • [58] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer (2018) Deflecting adversarial attacks with pixel deflection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8571–8580. Cited by: §1.
  • [59] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [60] S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §3.1, §3.2.
  • [61] M. Sakurada and T. Yairi (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4–11. Cited by: §2.2.
  • [62] O. Sener and V. Koltun (2018) Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pp. 527–538. Cited by: §3.2.
  • [63] D. Shen, G. Wu, and H. Suk (2017) Deep learning in medical image analysis. Annual review of biomedical engineering 19, pp. 221–248. Cited by: §1.
  • [64] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [65] L. Smith and Y. Gal (2018) Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533. Cited by: §1.
  • [66] A. Sotgiu, A. Demontis, M. Melis, B. Biggio, G. Fumera, X. Feng, and F. Roli (2020) Deep neural rejection against adversarial examples. EURASIP Journal on Information Security 2020, pp. 1–10. Cited by: §1, §1, item 1, §3.1, §3.2, §4.1.6, §4.2.6, §4.2.9.
  • [67] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §1.
  • [68] J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    23 (5), pp. 828–841.
    Cited by: §1, §4.1.5, §4.2.3.
  • [69] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • [70] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §1.
  • [71] S. Vandenhende, S. Georgoulis, M. Proesmans, D. Dai, and L. Van Gool (2020) Revisiting multi-task learning in the deep learning era. arXiv preprint arXiv:2004.13379. Cited by: §3.1, §3.2.
  • [72] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1, §3.2.
  • [73] C. Xie, M. Tan, B. Gong, A. Yuille, and Q. V. Le (2020) Smooth adversarial training. arXiv preprint arXiv:2006.14536. Cited by: §1.
  • [74] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501–509. Cited by: §1.
  • [75] W. Xu, D. Evans, and Y. Qi (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: §1, §2.1, §3.2, §4.1.6, §4.2.9.
  • [76] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma (2019) Be your own teacher: improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3713–3722. Cited by: §3.2.
  • [77] L. Zhang, Z. Tan, J. Song, J. Chen, C. Bao, and K. Ma (2019) SCAN: a scalable neural networks framework towards compact and efficient models. In Advances in Neural Information Processing Systems, pp. 4027–4036. Cited by: §3.2, §3.2.
  • [78] L. Zhang, M. Yu, T. Chen, Z. Shi, C. Bao, and K. Ma (2020) Auxiliary training: towards accurate and robust models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 372–381. Cited by: §3.2.
  • [79] C. Zhou and R. C. Paffenroth (2017) Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674. Cited by: §2.2.
  • [80] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018)

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection

    .
    In International Conference on Learning Representations, Cited by: §2.2.