1. Introduction
Despite achieving tremendous successes, deep neural networks (DNNs) have been shown to be vulnerable against adversarial examples (Goodfellow et al., 2015; Szegedy et al., 2013)
. By adding imperceptible perturbations to the original inputs, the attacker can craft adversarial examples to fool a learned classifier. Adversarial examples are indistinguishable from the original input image to human, but are misclassified by the classifier. With the wide application of machine learning models, this causes concerns about the safety of machine learning systems in security sensitive areas, such as selfdriving cars, flight control systems, healthcare systems and so on.
There has been extensive research on improving the robustness of deep neural networks against adversarial examples. In (Athalye et al., 2018), the author showed that many defense methods (Dhillon et al., 2018; Ma et al., 2018; Samangouei et al., 2018; Song et al., 2017; Xie et al., 2017) can be circumvented by strong attacks except Madry’s adversarial training (Madry et al., 2017), in which adversarial examples are generated during training and added back to the training set. Since then adversarial trainingbased algorithms became stateoftheart methods in defending against adversarial examples. However, despite being able to improve robustness under strong attacks, adversarial trainingbased algorithms are timeconsuming due to the cost of generating adversarial examples onthefly. Improving the robustness of deep neural network remains an open question.
Due to the difficulty of defense, recent work has turned to attempting to detect adversarial examples as an alternative solution. The main assumption made by the detectors is that adversarial samples come from a distribution that is different from the natural data distribution, that is, adversarial samples do not lie on the data manifold, and DNNs perform correctly only near the manifold of training data (Tanay and Griffin, 2016). Many works have been done to study the characteristics of adversarial examples and leverage the characteristics to detect adversarial examples instead of trying to classify them correctly (Ma et al., 2018; Feinman et al., 2017; Zheng and Hong, 2018; Pang et al., 2018; Tao et al., 2018; Lee et al., 2018; Yang et al., 2020).
Despite many algorithms have been proposed for adversarial detection, most of them are deterministic, which means they can only use the information from one single forward pass to detect adversarial examples. This makes it easier for an attacker to break those models, especially when the attacker knows the neural network architecture and weights. In this paper, we propose a novel algorithm to detect adversarial examples based on randomized neural networks. Intuitively, incorporating randomness in neural networks can improve the smoothness of predictors, thus enabling stronger robustness guarantees (see randomized based defense methods in (Liu et al., 2017; Xie et al., 2017; Cohen et al., 2019)). Further, instead of observing only one hidden feature for each layer, a randomized network can lead to a distribution of hidden features, making it easier for detecting an outofmanifold example. We propose a detection method based on Bayesian neural network (BNN), leveraging the randomness of BNN to improve detection performance (see framework in Figure 1). The proposed method BATector is motivated by the following observations: 1) the hidden layer output generated from adversarial examples demonstrates different characteristics from that generated from natural data and this phenomenon is more obvious in BNN than deep neural network without random components; 2) randomness of BNN makes it easier to simulate the characteristics of hidden layer output. Training BNN is not very timeconsuming as it only doubles the number of parameters of deep neural network with the same structure. However, BNN can achieve comparable classification accuracy and improve the smoothness of the classifier.
In numerical experiments, our method achieves better performance in detecting adversarial examples generated from popular attack methods on MNIST, CIFAR10 and ImageNetSub among stateoftheart detection methods. Ablation experiments show that BNN performs better than neural network without random components using the same detection scheme. Besides, customized attack method is developed to break the proposed detection method and the results show that the proposed method can achieve reasonable performance under both customized strong attack and high confidence attack.
Notation
In this paper, all the vectors are denoted as bold symbols. The input to the classifier is represented by
and the label associated with the input is represented by . Thus, one observation is a pair of . The classifier is denoted as and represents the output vector of the classifier. The dimension of equals to the number of classes of the dataset. is the score of predicting with label . The prediction of the classifier is denoted as ; that is, the predicted label is the one with the highest prediction score. We use the and distortion metrics to measure similarity and report the distance in the normalized space (e.g., a distortion of corresponds to ), and the distance as the total rootmeansquare distortion normalized by the total number of pixels.2. Related Work
Adversarial attack
Multiple attack methods have been introduced for crafting adversarial examples to attack deep neural networks (Yuan et al., 2019; Athalye et al., 2018; Carlini and Wagner, 2017a, b; MoosaviDezfooli et al., 2017)
. Depending on the information available to the adversary, attack methods can be divided into whitebox attacks and blackbox attacks. Under whitebox setting, the adversary is allowed to analytically compute the model’s gradients/parameters, and has full access to the model architecture. Most whitebox attacks generate adversarial examples based on the gradient of loss function with respect to the input
(MoosaviDezfooli et al., 2016; Chen et al., 2018; Madry et al., 2017; Carlini and Wagner, 2017b). Among them FGSM, CW and PGD attacks have been widely used to test the robustness of machine learning models. In reality, the detailed model information, such as the gradient, may not be available to the attackers. Some attack methods are more agnostic and only rely on the predicted labels or scores (Chen et al., 2017; Brendel et al., 2017; Ilyas et al., 2018; Cheng et al., 2019; Yan et al., 2019). (Chen et al., 2017)proposed a method to estimate the gradient based on the score information and craft adversarial examples with the estimated gradient.
(Brendel et al., 2017; Ilyas et al., 2018; Cheng et al., 2019; Yan et al., 2019; Chen et al., 2019) introduced methods that also only rely on the final decision of the model.Adversarial defense
To defend against adversarial examples, many studies have been done to improve the robustness of deep neural networks, including adversarial training (Madry et al., 2017; Kurakin et al., 2016; Tramèr et al., 2017; Zhang et al., 2019), generative models (Samangouei et al., 2018; Meng and Chen, 2017; Li et al., 2020; Jalal et al., 2017) and verifiable defense (Wong and Kolter, 2018). The authors of (Athalye et al., 2018) showed that many defense methods (Dhillon et al., 2018; Ma et al., 2018; Samangouei et al., 2018; Song et al., 2017; Xie et al., 2017) can be circumvented by strong attacks except Madry’s adversarial training (Madry et al., 2017). Since then adversarial trainingbased algorithms became stateoftheart methods in defending against adversarial examples. However, adversarial training is computationally expensive and timeconsuming due to the cost of generating adversarial examples onthefly, thus adversarial defense is still an open problem to solve.
Adversarial detection
Another popular line of research focuses on screening out adversarial examples. A straightforward way towards adversarial example detection is to build a simple binary classifier separating the adversarial apart from the clean data based on the characteristics of adversarial examples (Gong et al., 2017; Feinman et al., 2017; Lee et al., 2018). In (Feinman et al., 2017), the author proposed to perform density estimation on the training data in the feature space of the last hidden layer to help detect adversarial examples (KD). The authors of (Ma et al., 2018) observed Local Intrinsic Dimension (LID) of hiddenlayer outputs differ between the original and adversarial examples, and leveraged this characteristics to detect adversarial examples. In (Lee et al., 2018)
, the authors generated the class conditional Gaussian distributions with respect to lowerlevel and upperlevel features of the deep neural network under Gaussian discriminant analysis, which result in a confidence score based on the Mahalanobis distance (MAHA), followed by a logistic regression model on the confidence scores to detect adversarial examples. In
(Yang et al., 2020), the author studied the feature attributions of adversarial examples and proposed a detection method based on feature attribution scores. The author of (Roth et al., 2019)showed that adversarial examples exist in conelike regions in very specific directions from their corresponding natural inputs and proposed a new test statistic to detect adversarial examples with the findings. Recently, a joint statistical testing pooling information from multiple layers is proposed in
(Raghuram et al., 2020) to detect adversarial examples. Through vast experiments, we show that our method achieves comparable or superior performance than these detection methods across various attacks.Bayesian neural network
The idea of BNN is illustrated in Figure 2. In (Blundell et al., 2015)
, the author introduced an efficient algorithm to learn parameters of BNN. Given the observable random variables
, BNN aims to estimate the distributions of hidden variables , instead of estimating the maximum likelihood value for the weights. Since in Bayesian perspective, each parameter is now a random variable measuring the uncertainty of the estimation, the model can potentially extract more information to support a better prediction (in terms of precision, robustness, etc.).Given the input and label , a BNN aims to estimate the posterior over the weights given the prior . The true posterior can be approximated by a parametric distribution , where the unknown parameter is estimated by minimizing the KL divergence
(1) 
over . For simplicity, is often assumed to be a fully factorized Gaussian distribution:
(2) 
where and are parameters of the Gaussian distributions of weight. The objective function for training BNN is reformulated from expression (1) and is shown in expression (3), which is a sum of a datadependent part and a regularization part:
(3) 
where represents the data distribution. In the first term of objective (3
), probability of
given and weights is the output of the model. This part represents the classification loss. The second term of objective (3) is trying to minimize the divergence between the prior and the parametric distribution, which can be viewed as regularization (Blundell et al., 2015).3. Proposed Method
We first discuss the motivation of the proposed method: 1) the distributions of the hidden layer neurons of DNN can be different for adversarial examples compared with natural examples; 2) this dispersion is more obvious in BNN than DNN; 3) it is easier to simulate hidden layer output distribution with random components. Then we introduce the specific metric used to measure this distributional difference and extend the detection method to multiple layers for making it more resistant to adversarial attacks.
3.1. Distributional difference of hidden layer outputs
Given input and a classifier , the prediction of the classifier is denoted as ; that is, the predicted label is the one with the highest prediction score. The adversary aims to perturb the original input to change the predicted label:
where denotes the perturbation added to the original input. The attacker aims to find a small (usually lies within a small norm ball) to successfully change the prediction of the model. Thus, given the same predicted label, there could be distributional difference of hidden layer outputs between adversarial examples and natural data. For example, adversarial examples misclassified as airplanes could have hidden layer output distributions different from those of natural airplane images. Here, we define a hidden layer output distribution in DNN as the empirical distribution of all the neuron values of that layer. While in BNN, it is a real distribution as the weights of BNN are stochastic. The proposed method is motivated by the observation that the distribution of hidden layer outputs will be different under adversarial perturbations (e.g., Figure 3).
Why BNN not DNN? Such pattern can be observed in both DNN and BNN. However, the distributional difference is more obvious in BNN than neural network without random component (see Figure 3). Therefore, more information can be extracted from BNN than general neural network. Furthermore, random components of BNN make it easier to simulate the hidden layer output distributions. Our experiment results also show that the proposed detection method works better with BNN than with general neural network on multiple datasets (see Section 4.2 for more details).
Figure 3 shows the hidden layer output distributions of layer , layer and layer in DNN and BNN. Blue and cyan (train and test) curves represent distributions of the natural automobile images in CIFAR10. Red curves represent the distributions of adversarial examples misclassified as automobile. The adversarial examples are generated by PGD (Madry et al., 2017) with norm. The DNN and BNN trained on CIFAR10 use the architecture of VGG16 (Simonyan and Zisserman, 2014) except that the weights in BNN follow Gaussian distributions. Hidden layer output distributions of train, test and adversarial examples are shown in blue, cyan and red. See details of attack parameters and network architectures in Appendix.
We can see that for all three hidden layers, the distributional differences are more obvious in BNN than DNN. In BNN, the hidden layer output distributions of the natural images (train or test) are clearly different from those of adversarial examples (adv), while the pattern is not obvious in DNN. Even though hidden layer output distributions of only three layers are shown here, similar patterns are observed in some other layers in BNN. This phenomenon is not a special case with PGD adversarial examples on CIFAR10. Such characteristics are also found in adversarial examples generated by different attack methods on other datasets. Therefore, we can leverage the distributional differences of multiple layers to detect adversarial examples.
3.2. Detect adversarial examples by distribution distance
We propose to measure the dispersion between hidden layer output distributions of adversarial examples and natural inputs and use this characteristic to detect adversarial examples. In particular, given an input and its predicted label , we measure the distribution distance between the hidden layer output distribution of and the corresponding hidden layer output distribution of training samples from class :
(4) 
where represents the hidden layer output distribution of the th layer based on testing sample , represents the hidden layer output distribution of the th layer based on training samples from class , is the number of training samples in class , and
can be arbitrary divergence. In this paper, we estimate the divergence with 1Wasserstein distance. However, other divergence measures can also be used, such as Kullback–Leibler divergence.
The hidden layer output distribution of training samples of each class can be easily simulated since there are multiple samples in each class. However, given predicted label, simulating the hidden layer output distribution of a testing sample is not easy. For general deep neural network without random components, the hidden layer output is deterministic, thus the simulation result depends on only a single forward pass. However, for BNN, the hidden layer output is stochastic, thus we can simulate the distribution with multiple passes of the input. To pool the information from different levels, the dispersion is measured at multiple hidden layers to generate a set of dispersion scores , where is the index set of selected hidden layers.
It is expected that natural inputs will have small dispersion scores while adversarial examples will have relatively large dispersion scores. A binary classifier is trained on the dispersion scores to detect adversarial examples. In the paper, we fit binomial logistic regression model to do the binary classification. Details of the method is included in Algorithm 1
3.3. Implementation Details
Layer Selection
For adversarial examples generated with different attacks on different datasets, the pattern of distributional differences can be different. For example, adversarial examples generated by PGD on CIFAR10 show larger distributional dispersion in deeper layers (layers closer to final layer). However, such characteristic does not appear in adversarial examples generated by C
W on CIFAR10. Instead, the distributional dispersion is more obvious in some front layers (layers closer to the input layer). Therefore, we develop an automate hidden layer selection scheme to find the layers with largest deviation between natural data and adversarial examples. Crossvalidation is performed to do layer selection by fitting binary classifier (logistic regression) with a single layer’s dispersion score. Layers with topranked performance measured by AUC (area under the receiver operating characteristic curve) scores are selected, and information from those layers are pooled for ultimate detection (See details of selected layers in Appendix).
Distance Calculation
To measure the dispersion between hidden layer output distributions of natural and adversarial samples, we treat the output of a hidden layer as a realization of a one dimensional random variable. The dispersion between two distributions is estimated by 1Wasserstein distance between their empirical distributions. In BNN, the empirical distribution of a test sample can be simulated by multiple forward passes. Whereas, in DNN, a single forward pass is done to simulate the empirical distribution as the output is deterministic. Training samples from the same class can be used to simulate empirical hidden layer output distributions of natural data of that class. Given a testing sample and its predicted label, calculating the dispersion score with all training samples in the predicted class is expensive, so we sample some natural images in the predicted class as representatives to speed up the process.
Dimension Reduction
To further improve computational efficiency, we apply dimension reduction on the hidden layer output. PCA (principal component analysis) is done with the hidden layer output of training samples to do dimension reduction before testing stage. At testing stage, hidden layer output is projected to lower dimension before calculating dispersion scores, which speeds up the dispersion score calculation with highdimensional output.
4. Experimental Results
We evaluated BATector on the following wellknown image classification datasets:

MNIST (LeCun, 1998): handwritten digit dataset, which consists of training images and testing images. These are black and white images in ten different classes.

CIFAR10 (Krizhevsky and Hinton, 2009): natural image dataset, which contains training images and testing images in ten different classes. These are low resolution color images.
The training sets provided by the datasets are used to train BNN and DNN. The BNN and DNN architectures are the same except that the weights of BNN follow Gaussian distributions. The test sets are split into in training folds and in test folds. The detection models (binary classifiers) of KD, LID and BATector are trained on the training folds and the test folds are used to evaluate the performances of different detection methods. Foolbox (Rauber et al., 2017) is used to generate adversarial examples with the following attack methods: FGSM (Goodfellow et al., 2015) with norm, PGD (Madry et al., 2017) with norm and CW (Carlini and Wagner, 2017a) with norm. Since BNN is stochastic, original PGD and CW attacks without considering randomness are not strong enough against it. For fair comparison, we update PGD and CW loss functions with stochastic optimization methods. Details on parameters, neural network architectures and implementation are provided in Appendix.
4.1. Comparison with Stateoftheart Methods
We compare the performance of BATector with the following stateoftheart detection methods for adversarial detection: 1) Kernel Density Detection (KD) (Feinman et al., 2017), 2) Local intrinsic dimensionality detection (LID) (Ma et al., 2018)
, 3) Odds are odd detection (ODD)
(Roth et al., 2019), 4) Reading Between the Layers (ReBeL) (Raghuram et al., 2020). In (Raghuram et al., 2020), ReBeL outperforms deep mahalanobis detection (Lee et al., 2018) and trust score (Jiang et al., 2018), so we do not include the performances of the two here. Details of implementation and parameters can be found in Appendix.Data  Metric  CW  
KD  LID  ODD  ReBeL  BATector  
CIFAR10  AUC  0.945  0.947  0.955  0.968  0.980 
TPR(FPR@0.01)  0.068  0.220  0.591  0.309  0.606  
TPR(FPR@0.05)  0.464  0.668  0.839  0.726  0.881  
TPR(FPR@0.10)  0.911  0.856  0.901  0.954  0.965  
MNIST  AUC  0.932  0.785  0.968  0.980  0.999 
TPR(FPR@0.01)  0.196  0.079  0.212  0.630  0.974  
TPR(FPR@0.05)  0.616  0.263  0.911  0.900  0.997  
TPR(FPR@0.10)  0.818  0.397  1.000  0.972  1.000  
Imagenetsub  AUC  0.811  0.905  0.886  0.834  0.941 
TPR(FPR@0.01)  0.193  0.401  0.185  0.035  0.146  
TPR(FPR@0.05)  0.452  0.653  0.398  0.167  0.538  
TPR(FPR@0.10)  0.584  0.754  0.566  0.312  0.815 
Data  Metric  FGSM  
KD  LID  ODD  ReBeL  BATector  
CIFAR10  AUC  0.873  0.957  0.968  0.990  0.995 
TPR(FPR@0.01)  0.136  0.385  0.224  0.698  0.878  
TPR(FPR@0.05)  0.401  0.753  0.709  0.974  0.991  
TPR(FPR@0.10)  0.572  0.875  1.000  1.000  0.998  
MNIST  AUC  0.933  0.888  0.952  0.992  0.999 
TPR(FPR@0.01)  0.421  0.152  0.898  0.885  0.972  
TPR(FPR@0.05)  0.692  0.503  0.908  0.990  0.998  
TPR(FPR@0.10)  0.796  0.678  0.917  1.000  1.000  
Imagenetsub  AUC  0.914  0.983  0.844  0.842  0.989 
TPR(FPR@0.01)  0.460  0.772  0.042  0.045  0.569  
TPR(FPR@0.05)  0.727  0.952  0.188  0.197  0.989  
TPR(FPR@0.10)  0.822  0.987  0.364  0.358  1.000 
Data  Metric  PGD  
KD  LID  ODD  ReBeL  BATector  
CIFAR10  AUC  0.791  0.777  0.963  0.962  0.971 
TPR(FPR@0.01)  0.018  0.093  0.059  0.191  0.813  
TPR(FPR@0.05)  0.148  0.317  0.819  0.789  0.881  
TPR(FPR@0.10)  0.285  0.448  0.999  0.999  0.917  
MNIST  AUC  0.801  0.861  0.967  0.975  0.989 
TPR(FPR@0.01)  0.062  0.170  0.607  0.382  0.733  
TPR(FPR@0.05)  0.275  0.396  0.934  0.851  0.957  
TPR(FPR@0.10)  0.429  0.552  0.945  0.956  0.999  
Imagenetsub  AUC  0.989  0.991  0.777  0.824  0.976 
TPR(FPR@0.01)  0.930  0.829  0.010  0.028  0.729  
TPR(FPR@0.05)  0.966  0.961  0.054  0.139  0.904  
TPR(FPR@0.10)  0.979  0.984  0.121  0.280  0.947 
We report area under the curve (AUC) of the ROC curve as the performance evaluation criterion as well as the true positive rates (TPR) by thresholding false positive rates (FPR) at 0.01, 0.05 and 0.1, as it is practical to keep misclassified natural data at a low proportion. TPR represents the proportion of adversarial examples classified as adversarial, and FPR represents the proportion of natural data misclassified as adversarial. Before calculating performance metrics, all the samples that can be classified correctly by the model are removed. The results are reported in Table 33 and ROC curves are shown in Figure 4. BATector shows superior or comparable performance over the other four detection methods across three datasets.
4.2. Ablation Study: BNN versus DNN
In this section, we compare the performance of BATector using different structures (BNN versus DNN) against PGD across three datasets. The detection methods are the same (as described in Algorithm 1) and the differences are: 1) BATector with DNN uses pretrained deep neural network of the same structure without random weights; 2) The number of passes is one as DNN does not produce different outputs with the same input. We report the class conditional AUC of the two different structures across three datasets. The comparison results on CIFAR10 and MNIST are shown in Table 4 and the results on Imagenetsub are shown in Figure 5. Since there classes in Imagenetsub, it is not reasonable to show the results in a table. Instead, we show the AUC histograms of BATector with different structures in Figure 5. Comparing the AUC of applying BATector with BNN and DNN on CIFAR10 and MNIST, it is obvious that BNN structure demonstrates superior performances all the time. On Imagenetsub, the AUC histogram of BATector with BNN ranges from to and is lefttailed, while the AUC histogram of BATector with DNN ranges from to and centers around , so BNN structure clearly outperforms on Imagenetsub.
Class  CIFAR10  MNIST  

BNN  DNN  BNN  DNN  
class1  0.978  0.489  0.929  0.901 
class2  0.972  0.410  1.000  0.967 
class3  0.973  0.501  0.993  0.892 
class4  0.994  0.594  0.991  0.958 
class5  0.955  0.477  1.000  0.883 
class6  0.995  0.729  0.999  0.937 
class7  0.976  0.584  0.989  0.878 
class8  0.973  0.537  1.000  0.941 
class9  0.915  0.493  0.959  0.874 
lass10  0.949  0.567  0.982  0.917 
4.3. Highconfidence Attack and Customized Attack
In (Athalye et al., 2018), the author pointed out that detection method can fail when the confidence level of adversarial examples generated by CW attack increases. Therefore, we also test BATector against highconfidence adversarial examples across three datasets. Adversarial examples generated by CW with confidence and are used in the experiments. The performances of BATector are reported in Table 5. The results show that BATector performs well against highconfidence adversarial examples. Our analysis finds that the hidden layers selected to perform detection are different when defending against highconfidence adversarial examples. Increasing confidence level may change the characteristics of adversarial examples, but with proper layer selection, BATector can still detect them.
Data  Metric  CW (Confidence)  PGD  PGD_RES  
0  10  20  
CIFAR10  AUC  0.980  0.999  0.995  0.971  0.893 
TPR(FPR@0.01)  0.606  0.998  0.939  0.813  0.002  
TPR(FPR@0.05)  0.881  1.000  0.995  0.881  0.321  
TPR(FPR@0.10)  0.965  1.000  0.995  0.917  0.606  
MNIST  AUC  0.999  0.995  0.995  0.989  0.945 
TPR(FPR@0.01)  0.974  0.913  0.919  0.733  0.519  
TPR(FPR@0.05)  0.997  0.993  0.994  0.957  0.739  
TPR(FPR@0.10)  1.000  0.998  0.999  0.999  0.851  
Imagenetsub  AUC  0.928  0.991  0.983  0.976  0. 915 
TPR(FPR@0.01)  0.146  0.896  0.642  0.729  0.221  
TPR(FPR@0.05)  0.538  0.951  0.910  0.904  0.607  
TPR(FPR@0.10)  0.815  0.977  0.964  0.947  0.785 
All the previous experiments are carried out in a graybox setting, where we assume the adversary has access to the classifier model but does not know the details of the detector. The whitebox setting assumes that the adversary has access to both classifier and detector. How to attack detection method under such setting is worth studying as it can reveal possible drawbacks of the method and promote future research direction. In this section, we develop a new attack method customized for BATector, called RestrictedPGD. It would be hard to attack both classifier and detector of BATector as the detection process involves a distribution simulation process and Wasserstein distance estimation process, which are difficult to get gradient from. Therefore, directly modifying the objective function of a gradientbased attack method does not work here.
However, we carefully observe the hidden layer output distributions of natural data and adversarial examples and find an interesting pattern: the hidden layer output standard deviations of natural data and those of PGD adversarial examples are very different, which might be the key characteristics leveraged by the detection method to defend against PGD attack. In Figure 6, we show the standard deviation distributions of hidden layer output of natural data and those of PGD adversarial examples. It is obvious that hidden layer output standard deviations of adversarial examples have larger values than those of natural data. This characteristic will make the hidden layer output distributions of natural data and those of adversarial examples different.
Therefore, we propose an updated PGD attack to restrict the differences between predicted scores of different classes in the final layer, which will decrease the standard deviation of hidden layer output, especially for layers close to final layer. Given input and its true label , the objective function of RestrictedPGD is given:
(5) 
where is the classification loss, is maximum perturbation and controls the weight of the restriction part. One drawback of RestrictedPGD is that it may not be able to decrease the output standard deviations of layers in the front. For fair evaluation under whitebox setting, only the last several layers are used by BATector to perform adversarial detection. Besides, the parameter is also carefully tuned to make sure the attack is strong. The performances of BATector against RestrictedPGD across three datasets are shown in Table 5. Though the performances of BATector are affected by RestrictedPGD but are still good on MNIST and Imagenetsub, and not bad on CIFAR10.
The hidden layer characteristics of adversarial examples generated by other attacks could be different from those of PGD. Therefore, the same trick may not work for FGSM and CW. Updated attack methods can be developed by carefully investigating the characteristics, and it could be a future research direction.
5. Conclusion
In this paper, we introduce a new framework to detect adversarial examples with Bayesian neural network, by capturing the distributional differences of multiple hidden layer outputs between the natural and adversarial examples. We show that our detection framework outperforms other stateoftheart methods in detecting adversarial examples generated by various kinds of attacks. It also displays strong performance in detecting adversarial examples of highconfidence levels and adversarial examples generated by customized attack method.
References
 Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In International Conference on Machine Learning (ICML), Cited by: §1, §2, §2, §4.3.
 Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §2, §2.
 Decisionbased adversarial attacks: reliable attacks against blackbox machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §2.

Adversarial examples are not easily detected: bypassing ten detection methods.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pp. 3–14. Cited by: 1st item, §2, §4.  Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 39–57. External Links: hepph/9609357 Cited by: §2.
 Hopskipjumpattack: a queryefficient decisionbased attack. arXiv preprint arXiv:1904.02144. Cited by: §2.
 EAD: elasticnet attacks to deep neural networks via adversarial examples. In AAAI, External Links: hepph/9609357 Cited by: §2.
 Zoo: zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. External Links: hepph/9609357 Cited by: §2.
 Signopt: a queryefficient hardlabel adversarial attack. arXiv preprint arXiv:1909.10773. Cited by: §2.
 Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918. Cited by: §1.
 Stochastic activation pruning for robust adversarial defense. arXiv preprint arXiv:1803.01442. Cited by: §1, §2.
 Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: 1st item, §1, §2, §4.1.
 Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960. Cited by: §2.
 Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: 2nd item, §1, §4.
 Blackbox adversarial attacks with limited queries and information. arXiv preprint arXiv:1804.08598. Cited by: §2.
 The robust manifold defense: adversarial training using generative models. arXiv preprint arXiv:1712.09196. Cited by: §2.
 To trust or not to trust a classifier.. In NeurIPS, pp. 5546–5557. Cited by: §4.1.
 Learning multiple layers of features from tiny images. Technical report Citeseer. External Links: hepph/9609357 Cited by: Table 6, Table 7, 2nd item.
 Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. External Links: hepph/9609357 Cited by: §2.
 The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. External Links: hepph/9609357 Cited by: Table 6, Table 7, 1st item.
 A simple unified framework for detecting outofdistribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §1, §2, §4.1.
 Optimal transport classifier: defending against adversarial attacks by regularized deep embedding. arXiv preprint arXiv:1811.07950. Cited by: §2.
 Towards robust neural networks via random selfensemble. arXiv preprint arXiv:1712.00673. External Links: hepph/9609357 Cited by: §1.
 Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: 2nd item, §1, §1, §2, §2, §4.1.

Towards deep learning models resistant to adversarial attacks
. arXiv preprint arXiv:1706.06083. External Links: hepph/9609357 Cited by: 3rd item, §1, §2, §2, Figure 3, §3.1, Figure 6, §4.  Magnet: a twopronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135–147. External Links: hepph/9609357 Cited by: §2.

Spectral normalization for generative adversarial networks
. arXiv preprint arXiv:1802.05957. External Links: hepph/9609357 Cited by: Table 6, Table 7, 3rd item.  Universal adversarial perturbations. arXiv preprint. External Links: hepph/9609357 Cited by: §2.

Deepfool: a simple and accurate method to fool deep neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2574–2582. External Links: hepph/9609357 Cited by: §2.  Towards robust detection of adversarial examples. In Advances in Neural Information Processing Systems, pp. 4584–4594. Cited by: §1.
 Detecting anomalous inputs to dnn classifiers by joint statistical testing at the layers. arXiv preprint arXiv:2007.15147. Cited by: 4th item, §2, §4.1.
 Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: Appendix B, §4.
 The odds are odd: a statistical test for detecting adversarial examples. In International Conference on Machine Learning, pp. 5498–5507. Cited by: 3rd item, §2, §4.1.
 Defensegan: protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605. External Links: hepph/9609357 Cited by: §1, §2.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Table 6, Table 7, §3.1.
 Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §1, §2.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. External Links: hepph/9609357 Cited by: §1.
 A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690. Cited by: §1.
 Attacks meet interpretability: attributesteered detection of adversarial samples. In Advances in Neural Information Processing Systems, pp. 7717–7728. Cited by: §1.
 Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §2.
 Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5286–5295. Cited by: §2.
 Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991. External Links: hepph/9609357 Cited by: §1, §1, §2.
 Subspace attack: exploiting promising subspaces for queryefficient blackbox attacks. arXiv preprint arXiv:1906.04392. Cited by: §2.
 Mlloo: detecting adversarial examples with feature attribution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 6639–6647. Cited by: §1, §2.
 Adversarial examples: attacks and defenses for deep learning. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2.
 Theoretically principled tradeoff between robustness and accuracy. arXiv preprint arXiv:1901.08573. Cited by: §2.
 Robust detection of adversarial attacks by modeling the intrinsic properties of deep neural networks. In Advances in Neural Information Processing Systems, pp. 7913–7922. Cited by: §1.
Appendix A Datasets and DNN Architectures
A summary of the datasets used, DNN architectures and test set performance of the corresponding DNNs are given in Table 6. For BNN, the structures used on the datasets are the same except that the weights of BNN are not deterministic but follow Gaussian distributions. See summary of BNN architectures and test set performances in Table 7.
Dataset  Number of Classes  Test Accuracy  Architecture 

MNIST (LeCun, 1998)  10  99.20  2Conv. + 2FC layers 
CIFAR10 (Krizhevsky and Hinton, 2009)  10  93.34  VGG16 with BN (Simonyan and Zisserman, 2014) 
ImagenetSub (Miyato et al., 2018)  143  65.05  VGG16 with BN (Simonyan and Zisserman, 2014) 
Datasets and DNN Architectures. Conv.: Convolutional layer, FC: Fully Connected layer, BN: Batch Normalization.
Dataset  Number of Classes  Test Accuracy  Architecture 

MNIST (LeCun, 1998)  10  98.81  2Conv. + 2FC layers 
CIFAR10 (Krizhevsky and Hinton, 2009)  10  92.32  VGG16 with BN (Simonyan and Zisserman, 2014) 
ImagenetSub (Miyato et al., 2018)  143  61.50  VGG16 with BN (Simonyan and Zisserman, 2014) 
Appendix B Parameters of Attack Methods
Foolbox (Rauber et al., 2017) is used to generate adversarial examples with CW, FGSM and PGD. Parameters used for the attack methods are listed below:

FGSM (Goodfellow et al., 2015) with norm: the maximum values of FGSM attack on MNIST, CIFAR10 and Imagenetsub are set to , and respectively.

PGD (Madry et al., 2017) with norm: the values of PGD attack on MNIST, CIFAR10 and Imagenetsub are set to , and respectively. The number of iterations on MNIST, CIFAR10 and Imagenetsub are set to , and respectively. The step size of PGD attack on MNIST, CIFAR10 and Imagenetsub are set to , and respectively.
For RestrictedPGD attack, the parameter values are the same as PGD on three datasets. For highconfidence CW attack, the confidence is set to and on all three datasts and the maximum number of iterations is set to .
Appendix C Layer Representations
Data  CW  FGSM  PGD  

Layers  Statistic  Layers  Statistic  Layers  Statistic  
MNIST  [3, 5, 6]  min  [3, 5, 6]  min  [3, 5, 6]  min 
CIFAR10  [4, 7]  min  [7]  mean  [39, 42, 43]  mean 
ImagenetSub  [3, 7]  min  [7]  min  [39, 42, 43]  mean 
A summary of the selected layers are given in Table 8. The distribution of each hidden layer in BNN is simulated with 4 forward passes of the inputs. We also list the summary statistics we use as a measure of average distance from the test sample to the natural image sets.
Appendix D Implementation of Detections
Implementation details of detection methods KD, LID, ODD, ReBeL and BATector are discussed in the following part:

KD (Feinman et al., 2017)
: The Kernel Density Detection method is implemented by converting the author implementation into pytorch version. The author implementation is available at Github
^{1}^{1}1 https://github.com/rfeinman/detectingadversarialsamples. The default parameter values are used for the experiments in this paper. 
LID (Ma et al., 2018): The Local Intrinsic Dimension detection is implemented by using the code from Github ^{2}^{2}2https://github.com/pokaxpoka/deep_Mahalanobis_detector. The numbers of neighbors used to calculate local intrinsic dimension are and .

ODD (Roth et al., 2019): Odds are odd is implemented using the author original implementation with code available at Github ^{3}^{3}3https://github.com/yk/icml19_public. The default parameter values are used for the experiment in this paper.

ReBeL (Raghuram et al., 2020): Reading Between the Layers is implemented using the author original implementation with code available at Github ^{4}^{4}4https://github.com/jayaramr/adversarialdetection. The test statistic is multinomial. The scoring method is pvalue. Fisher method is used to combine the pvalues from different layers. For other parameters, the default values are used.

BATector: Code is available at Github ^{5}^{5}5https://github.com/BayesianDetection/BayesianDetection.
Comments
There are no comments yet.