Deep Neural Networks (DNNs) have achieved state-of-the-art accuracy on many classification problems, including speech recognition and image classification . However, recent studies have shown DNNs lack the ability to detect abnormal input samples and, as a result, output incorrect predictions for this with high confidence. The ability to detect abnormal samples is essential when deploying DNNs in real-world applications.
In this paper, we consider two types of abnormal input: adversarial samples and out-of-distribution samples. Adversarial samples are generated by intentionally perturbing legitimate samples with the goal of inducing errant predictions in the classification model . Adversarial perturbations are usually small-scale and invisible to human observers. Out-of-distribution (OOD) samples are anomalous samples not belonging to the training data distribution 
. For example, an image of a flower would be considered OOD for a classification model trained on bird images. Significantly, DNNs incorrectly classify OOD samples (from unknown classes) into known classes but with high confidence.
Several methods have been developed to address abnormal inputs. In one approach, extra modules are trained and used alongside classification models exclusively for the detection and rejection of abnormal inputs . Other methods add abnormal samples (but with corrected labels) to standard training sets, and fine-tune classification models to obtain the correct results using the augmented training set . However, those methods are computation- and data-intensive in practice. Furthermore, detection methods for OOD and adversarial samples are usually developed separately, increasing the burden of deployment. In this paper, we show that the statistics of feature embeddings in classification models are effective for detecting both adversarial and OOD samples in a lightweight and unified manner.
We observe that abnormal samples, both adversarial and OOD, belong to distributions distinct from the distribution for legitimate samples (i.e., the distribution of training set). Thus, detecting abnormal samples translates to measuring how far the input samples’ distribution deviates from the legitimate distribution.
Integral probability metrics (IPMs)  are a family of probability distance measures that are useful in this context. The idea of IPMs is simple: 1) project two sets of samples from two distributions to a new space using a mapping function, and 2) compare the statistics of the two sets in the projected space. If the statistics are similar, the two sets likely belong to the same distribution. IPMs include various popular distance measures, such as the Wasserstein distance  and maximum mean discrepancy (MMD) . Previous studies have shown some preliminary results of adversarial detection using MMD. However, they conclude that MMD is not a suitable metric: it only performs well for detecting simple attacks (i.e., FGSM 
) on simple datasets (i.e., MNIST) and requires large batch sizes [5, 14].
We improve MMD by replacing a non-parametric mapping function with a neural network, which is the original pre-trained classification model. The proposed novel metric, Channel Mean Discrepancy (CMD), can be computed in a straightforward and efficient manner. CMDs for a given batch of samples are computed by channel-wise comparison of feature means and running means. Feature means are derived from the channel-wise average of feature embeddings extracted by the classification model. Running means are retrieved from batch normalization layers and are reasonable estimations of statistics over legitimate distributions. We empirically show the arithmetic sum of CMDs over all channels is significantly smaller for legitimate batches than for abnormal batches. Compared to previous works, the minimum batch size for detection is reduced from 50 to 4. To achieve single sample detection (i.e., batch size), we tune channel coefficients for the CMDs before performing a weighted sum according to each channel’s sensitivity to abnormal samples.
We extensively evaluate our proposed single sample detection method across various models, datasets, adversarial attacks, and OOD datasets. Our method consistently outperforms other methods and demonstrates high generalizability across attacks and OOD datasets.
In summary, the advantages of the novel metric CMD are:
CMD is model-agnostic and may be applied to pre-trained classification models without modifying model architecture or performing expensive fine-tuning.
For small batch sizes, the arithmetic sum of CMDs is already effective for measuring how far an incoming batch deviates from a legitimate distribution. Single sample detection is achieved by using a weighted sum of CMDs with tuned channel sensitivity coefficients. In addition, channel sensitivities tuned for one type of abnormal data generalizes well to other types of data.
Computing CMDs for a given batch is trivial, requiring only a single forward pass on the classification model being used.
2 Related Work
Previous work on detecting OOD and adversarial samples can be categorized into two classes. One class requires modifying model architectures or fine-tuning models. The other class requires only a pre-trained model.
Approaches requiring modifying model architectures or fine-tuning models.
To detect OOD samples, Gal et al.  propose using an ensemble of sub-networks to distinguish OOD samples by estimating the predictive uncertainty of each sample. Recent work [24, 7, 18] improves the ability to detect OOD samples by fine-tuning models with additional samples crafted using a GAN 
or novel loss function.
To detect adversarial sample, recent work [14, 11, 26] proposes reconstructing models or training a secondary DNN from scratch for the express purpose of identifying adversarial samples. The most recent work  proposes retraining a DNN model using a reverse cross-entropy loss function and an embedded K-density detector.
Approaches with only pre-trained model
A set of work discusses techniques for detecting OOD and adversarial samples which do not require modifying model architecture or performing additional fine-tuning [17, 27, 25, 29]. Hendrycks et al.  propose a baseline approach with the maximum value of posterior distributions from DNNs. Liang et al.  build on Hendrycks et al.’s work by pre-processing the input and its corresponding output, but requires an additional forward and backward pass at inference time. Ma et al.  propose using the local intrinsic dimensionality (LID) property of inputs to distinguish adversarial samples, but adds overhead for computing the LID of each inner layer during inference. Recent work  by Lee et al.
distinguishes legitimate samples from OOD and adversarial samples by measuring the Mahalanobis distance of each input. However, this requires calculating the statistical characteristics (i.e., variance and mean) of each inner layer and multiple forward passes for each at inference time. For brevity, this technique is henceforth referred to as Mahalanobis throughout this paper. To the best of our knowledge, Mahalanobis is currently the only unified approach which can detect both OOD and adversarial samples.
|DUQ ||RCE ||ODIN ||LID ||Mahalanobis ||Ours|
|No architecture modification||✓||✓||✓||✓|
|Single pass during inference||✓||✓||✓|
|Inference Latency (s)||-||-||3.10||8.49||2.83||0.85|
) 100 times with ResNet-34 on CIFAR-10, detailed inSec. 4.3.
3.1 Out-of-Distribution and Adversarial Samples
Adversarial samples are intentionally generated to induce misclassification of target model [6, 30, 13]. Adversarial samples are crafted by adding visually indistinguishable perturbations to legitimate samples, resulting in a significant change in the predictive output of the target model. Such procedure can be formulated as:
where and represent a target DNN and an adversarial sample, respectively. is used to evaluate the attack efficacy, as intended by the attacker [6, 14] . constrains the -norm distances between and , and is typically a small value.
3.2 Distribution Distances
Consider two sets of samples: and from two different distributions and , respectively. Integral probability metrics (IPMs)  employ a class of witness functions that determine how close and are to each other, and chooses the function resulting in the largest discrepancy in expectation over and ,
where Eq. (missing) is derived from empirical estimations of expectation in Eq. (missing). satisfies three axioms: non-negativity, symmetry, and triangle inequality, and is a pseudo-metric where but does not necessarily imply .
IPMs are a general framework, so various probability metrics can be defined by choosing an appropriate class of witness functions. For example, the Wasserstein-1 metric is defined using 1-Lipschitz functions () . By using , we obtain maximum mean discrepancy (MMD), where represents a reproducing kernel Hilbert space with as its reproducing kernel .
3.3 Detection with Maximum Mean Discrepancy
A specific widely-used kernel in MMD is the Gaussian kernel: . The corresponding implicit feature map has an infinite number of output dimensions when using Taylor expansion. Previous work shows that MMD is able to detect adversarial samples using Fisher’s permutation test . To do this, we initially compute , where is the batch to be detected and is a random legitimate batch as reference. Then, we mutually exchange samples from and to create two new sets, and , and recompute . This exchange and recomputation process is repeated 1000 times. If the number of exchanged MMDs larger than exceeds a pre-defined threshold, is determined to be a batch consisting entirely of adversarial samples. For simplicity, the sizes of and are the same and referred to as the batch size. Preliminary results in  show that MMD can be used to find adversarial batches when batch size is sufficiently large (e.g., 50 for an FGSM attack on MNIST). However, this method fails against stronger attacks (e.g., CW attack) or on more complex datasets (e.g., CIFAR-10) .
The failure of MMD with a Gaussian kernel is due to its limitations: 1) it has insufficient capacity to handle high dimensional data with complex distributions and 2) it cannot effectively model semantic information for images .
To address the limitations of MMD in detecting abnormal samples, we describe an improved IPM that leverages well-trained neural networks in the following section. The proposed metric tremendously reduces required batch size and obviates the need for the costly Fisher’s permutation process.
3.4 Detection with Channel Mean Discrepancy
We first choose a class of functions with the form for use with IPMs. Here, is a scalar vector with bounded norm , and is a neural network parameterized by . Given , we see is a finite dimensional Hilbert space. The metric indexed by is given by:
In Eq. (missing), the maximum is taken over the mapping function , which is a neural network. However, there is no need to optimize to maximize Eq. (missing): we simply need a suitable mapping function capable of differentiating samples from two distributions and . We see a pre-trained classification model fits this criteria. As illustrated in Fig. 2, while adversarial and legitimate samples are visually indistinguishable, the features extracted by the classification model differ significantly. We also show that a random network with the same architecture cannot reveal feature space differences between adversarial and legitimate samples in Fig. 1. Based on analysis of the pre-trained classification model, we define a novel metric
where is the mean of the feature embeddings extracted from
. In this paper, we consider the classification model as a convolutional neural network and so only compute the mean of feature embeddings in a channel-wise manner. We call the metric defined inEq. (missing) the Channel Mean Discrepancy (CMD).
Given a legitimate distribution , the second term in Eq. (missing) is ideally the mean of the feature embeddings across the entire training set. However, we find the channel-wise moving average stored in the widely-used Batch Normalization (BN) layer serves as a reasonable approximation of . A BN layer normalizes the feature embeddings to alleviate covariate shifts. During the training phase, a BN layer keeps running estimates of channel-wise feature mean (and variance) computed on the training batches, which are then used for normalization during inference. Given an input batch, the CMDs are computed by: 1) passing the batch to the classification model , 2) computing the batch mean along each channel of features, and 3) comparing the difference between the batch means and the moving averages from the BN layers.
Since adversarial samples are crafted to significantly impact the prediction of classification models, we can expect to be significantly larger than . In practice, a convolution neural network usually has several convolutional layers, and thus can extract multi-level features to compute their corresponding CMDs. We sum all available CMDs, computed across various channels and layers, to make the difference between and more pronounced.
Fig. 1 (b) shows the effectiveness of summed CMDs for distinguishing adversarial batches from legitimate batches. The adversarial samples are crafted via the CW attack with a batch size of 8. In contrast, MMD fails to differentiate adversarial and legitimate batches under the same setting as Fig. 1 (a). For detection with MMD, we use the procedure and hyper-parameters detailed in Sec. 3.3 and [5, 14]. To further validate the necessity of using the pre-trained classification model as the mapping function in Eq. (missing), we replace the pre-trained classification model with a randomly weighted counterpart. As shown in Fig. 1 (c), summed CMDs for a randomly weighted model cannot effectively differentiate adversarial and legitimate batches. More details on the experiments in Fig. 1 can be found in Appendix.
3.5 Effect of Batch Size
Consider a specific channel in the feature embeddings whose mean is computed over all elements from different spatial positions in the channel and different samples in a batch. Each element responds to a corresponding region in a raw image, as illustrated in Fig. 3, known as the receptive field . Receptive field size increases linearly with layer depth in the CNN. As a result, computing the channel mean by averaging elements across spatial positions can be thought of as averaging over several sub-images. In addition, multiple channels from one feature embedding and multi-level feature embeddings from layers at different depths in the model can be further utilized to increase the versatility of CMDs. This implicitly augments the input batch, and thus enabling CMDs to survive with extremely small batch sizes when compared to previous MMD-based methods.
Given a batch of input samples, we use the sum of all available CMDs as a metric of how far samples in the batch deviate from the legitimate training distribution.
As a result, the minimum batch size required by CMD is much smaller than with MMD.  reports the minimum batch size to differentiate adversarial and legitimate batches is 50 for a FGSM attack on MNIST. However, it will fail to detect attacks on CIFAR-10 even when using a batch size of 10,000 . In contrast, our summed CMD is able to differentiate adversarial and legitimate batches with batch sizes as small as 4 for a CW attack on CIFAR-10. We highlight two aspects of our approach which contribute to the smaller achievable batch size: 1) CMD uses a well-trained CNN as a much more powerful mapping function to model semantic information by considering pixel correlations and 2) multi-level and multi-channel CMDs are equivalent to augment the input batch with sub-figures with different scales as illustrated in Fig. 3.
Although CMD significantly reduces required batch size, the ultimate goal is to achieve single image detection (i.e., batch size ). Using summed CMDs, the detection AUROC drops from 99.31% to 82.11% as the batch size decreases from 4 to 1. In the following section, we further improve summed CMDs with a lightweight detection model to achieve single image detection.
3.6 Sensitivity-Aware CMD
. Instead of naively treating all CMDs equally, we learn a coefficient for each CMD before summing. Since detection of abnormal samples is a binary classification task, learning these coefficients translates to fitting a simple logistic regression (LR) detection model. The input to the LR model is a vector consisting of all available CMDs for the classification model and a given sample. The LR model performs a dot product on the input with the learned coefficient vector
and outputs the prediction score after applying the sigmoid function,
LR model is lightweight and can be trained efficiently by several optimization methods . In addition, training (and inference) of the LR model requires no raw samples but just vectors of CMDs which are cheap to be obtained by a single forward propagation with the classification model given raw samples.
|OOD||Baseline ||ODIN ||Mahalanobis ||Ours|
|TNR at TPR 95% / AUROC / Detection acc.|
|SVHN||SVHN||32.5 / 89.9 / 85.1||86.6 / 96.7 / 91.1||96.4 / 99.1 / 95.8||100 / 99.9 / 99.5|
|LSUN||45.4 / 91.0 / 85.3||13.8 / 81.1 / 55.5||31.2 / 86.8 / 79.6||85.8 / 97.1 / 91.2|
|TinyImageNet||44.7 / 91.0 / 85.1||16.3 / 82.9 / 77.9||40.3 / 88.4 / 81.0||83.3 / 96.7 / 90.1|
|LSUN||LSUN||45.4 / 91.0 / 85.3||73.8 / 94.1 / 86.7||98.9 /99.7 / 97.7||100 / 99.9 / 99.8|
|SVHN||32.5 / 89.9 / 85.1||35.5 / 83.6 / 74.6||80.4 / 96.8 / 91.8||92.2 / 98.0 / 93.7|
|TinyImageNet||44.7 / 91.0 / 85.1||72.5 / 94.0/ 86.5||96.9 / 99.5 / 96.2||99.9 / 99.9 / 98.9|
|SVHN||SVHN||20.3 / 79.5 / 73.2||62.7 / 93.9 / 88.0||91.9 / 98.4 / 93.7||100 / 99.9 / 99.4|
|LSUN||18.8 / 75.8 / 69.9||2.7 / 51.7 / 52.1||1.8 / 44.0 / 50.0||53.6 / 90.6 / 83.4|
|TinyImageNet||20.4 / 77.2 / 70.8||4.1 / 57.0 / 55.9||7.1 / 58.9 / 58.7||55.5 / 91.2 / 83.5|
|LSUN||LSUN||18.8 / 75.8 / 69.9||45.6/ 85.6 / 78.3||90.9 / 98.2 / 93.5||100 / 99.9 / 99.8|
|SVHN||20.3 / 79.5 / 73.2||16.5 / 70.7 / 65.6||55.1 / 92.4 / 85.4||79.8 / 94.2 / 88.8|
|TinyImageNet||20.4 / 77.2 / 70.8||48.5 / 87.8 / 80.3||89.9 / 98.0 / 92.9||99.9 / 99.9 / 98.7|
|SVHN||SVHN||40.2 / 89.9 / 83.2||86.2 / 95.5 / 91.4||90.8 / 98.1 / 93.9||99.9 / 99.9 / 99.3|
|LSUN||66.4 / 95.4 / 90.3||78.1 / 94.6 / 91.2||42.9 / 84.9 / 80.1||91.6 / 98.2 / 93.4|
|TinyImageNet||58.8 / 94.1 / 88.5||68.2 / 91.2 / 88.4||43.3 / 77.7 / 75.9||82.5 / 96.8 / 90.9|
|LSUN||LSUN||66.6 / 95.4 / 90.3||96.2 / 99.2 / 95.7||97.2 / 99.3 / 96.3||99.9 / 99.9 / 99.6|
|SVHN||40.0 / 89.9 / 83.2||37.5 / 87.3 / 79.6||12.2 / 45.9 / 54.6||86.1 / 96.4 / 91.1|
|TinyImageNet||58.8 / 94.1 / 88.5||91.8 / 98.4 / 93.8||92.2 / 97.1 / 93.6||99.9 / 99.9 / 98.6|
|SVHN||SVHN||26.7 / 82.7 / 75.6||70.6 / 93.8 / 86.6||82.5 / 97.2 / 91.5||99.9 / 99.9 / 99.1|
|LSUN||16.7 / 70.8 / 64.9||6.0 / 67.2 / 64.0||47.1 / 82.1 / 77.4||70.4 / 94.7 / 87.8|
|TinyImageNet||17.5 / 71.6 / 65.7||7.5 / 69.1 / 65.0||52.1 / 80.6 / 77.1||57.2 / 92.6 / 85.4|
|LSUN||LSUN||16.7 / 70.8 / 64.9||41.2 / 85.5 / 77.1||91.4 / 98.0 / 93.9||100 / 99.9 / 99.7|
|SVHN||26.7 / 82.7 / 75.6||26.1 / 85.2 / 78.7||64.0 / 90.3 / 84.2||78.2 / 93.3 / 88.2|
|TinyImageNet||17.5 / 71.6 / 65.7||41.1 / 84.8 / 76.6||83.0 / 94.1 / 89.6||99.8 / 99.6 / 98.4|
We evaluate our technique using multiple classification models, including ResNet-34  and DenseNet-100  trained on CIFAR-10, CIFAR-100, and SVHN datasets. To validate the generality of our method and for fairness, we use pre-trained weights directly downloaded from previous works111github.com/pokaxpoka/deep_Mahalanobis_detector/
or open-source implementations222github.com/osmr/imgclsmob for all classification models. Consistent with the configurations from [25, 29]
, we adopt the following evaluation metrics: the true negative rate (TNR) at
, true positive rate (TPR), area under the receiver operating characteristic curve (AUROC), and detection accuracy. We compare against state-of-the-art methods which do not require modifying model architectures or fine-tuning classification models because they require a similar magnitude of training effort to our own.
4.1 Detecting Out-of-distribution Samples
We evaluate the discriminatory power of CMD on four OOD datasets, CIFAR-10, SVHN, TinyImageNet, and LSUN. 
The TinyImageNet dataset is a subset of ImageNet, and the LSUN dataset contains scene images from 10 different categories. To reconcile image size differences for CIFAR-10 and SVHN, images in TinyImageNet and LSUN are resized topixels. We refer readers to  for more details. In each experiment, the training dataset used for the classification model is considered the in-distribution dataset.
For our method, we set batch size to 1 and use sensitivity-aware CMD, as detailed in Sec. 3.6, to detect OOD samples. To train the logistic regression model for sensitivity-aware CMD, we randomly select 2000 samples from the in-distribution dataset and another 2000 samples from the OOD dataset, following the same settings as in . We generate corresponding CMD vectors for each of the 4000 images for logistic regression training. Detection performance of the logistic regression is tested on a mixture of the in-distribution testing set and the OOD testing set. In real-world applications, a classification model does not know the type of OOD samples in advance. Thus, we are interested in how a logistic regression model, trained on one type of OOD dataset, performs on another OOD dataset. As shown in Table 2, sensitivity-aware CMD consistently outperforms the compared methods on various datasets and classification models by a large margin, and expresses higher transferability across OOD datasets.
4.2 Detecting Adversarial Samples
We evaluate our approach using four attack methods: FGSM , BIM , DeepFool  and CW . Consistent with previous work [27, 25], the datasets used in training and inference consist of three kinds of data: 1) legitimate samples from the original training and testing sets, 2) adversarial samples crafted from legitimate samples using the specific attack methods, and 3) noisy samples crafted by adding random noise to legitimate samples. When crafting adversarial and noisy samples, we constrain the scale of adversarial and noisy perturbations as described in . Note that all adversarial samples yield different prediction results when compared to their corresponding legitimate counterparts, but the noisy samples yield identical (correct) results to their counterpart. A robust detection method should be able to classify noisy samples as normal ones.
Similar to , we randomly select 2000 legitimate samples from the original training dataset and their corresponding adversarial and noisy samples to train the logistic regression model. A mixture of legitimate, adversarial, and noisy testing samples is used to evaluate the methods.
We also evaluate transferability of the logistic regression model. Specifically, a logistic regression trained on one type of attack is also evaluated on the other attack types. We find that the logistic regression model trained on the CW attack has the best transferability. Mahalanobis transfers best to other attacks when trained using the FGSM attack. Thus, the ‘transfer’ results reported in Table 3 are for Mahalanobis trained on FGSM, and our method trained on CW.
We compare our method with LID  and Mahalanobis . Similar to above, hyper-parameters for LID and Mahalanobis are carefully tuned via grid search, and their results are reported using optimal hyper-parameters. As shown in Table 3, our method consistently outperforms LID and Mahalanobis in both ‘no transfer’ and ‘transfer’ settings.
4.3 Inference Cost
Our proposed method is lightweight and efficient: for an incoming sample, only a single forward pass on the pre-trained classification model is required to compute its CMD vector. The extra overhead introduced by our detection method is negligible during inference. We compare our method with other approaches in terms of the inference latency. Specifically, we test each method with eight samples (i.e., batch size ) and repeat the inference procedure for 100 times. Testing hardware included an Intel Xeon(R) CPU E5-2650 v4 CPU, NVIDIA 1080 Ti GPU, and 64 GB RAM. The times for 100 runs are reported in Table 1. We find our method outperforms all other methods for inference cost, because our approach only requires one single forward pass for computing CMD vectors. In contrast, the other approaches (i.e., ODIN, Mahalanobis and LID) typically require either: 1) at least one additional forward and backward pass for adding noise to inputs or 2) searching several nearest neighbours by measuring the activation values of multiple inner layers.
|TNR at TPR 95% / AUROC / Detection acc.|
|LSUN||LSUN||100 / 99.9 / 99.8||100 / 99.8 / 99.3|
|SVHN||92.2 / 98.0 / 93.7||99.9 / 99.9 / 99.3|
|TinyImageNet||99.9 / 99.9 / 98.9||100 / 99.8 / 99.3|
4.4 Applying CMD on Models without BN Layers
CMD retrieves running average statistics stored in the BN layers to approximate the statistics of feature embeddings ( in Eq. (missing)) across the entire legitimate training set. If a classification model does not have BN layers, we can manually estimate by traversing the entire training set with the classification model and computing the running average statistics for each channel.
We use VGG-19 trained on CIFAR-10 as an example. VGG-19 has 16 convolutional layers and 3 fully-connected layers, and no BN layers. We use the aforementioned method to compute the running average statistics before repeating the OOD experiments. As shown in Table 4, detection performance does not degrade at all.
Abnormal samples, both adversarial and out-of-distribution, belong to distributions distinct from the distribution of legitimate samples. Inspired by integral probability metrics, we proposed a novel metric, Channel Mean Discrepancy (CMD), for evaluating how far input samples’ distributions deviate from a legitimate distribution. We showed how CMD could be used to reliably and efficiently identify both out-of-distribution and adversarial samples in a unified way. From a theoretical perspective, we have provided an intuition as to why CMD can be computed by channel-wise comparison of feature means and running means (retrieved from batch normalization layers). Our empirical results validate the effectiveness and generalizability of CMD, and a simple logistic regression model with CMDs can achieve state-of-the-art detection performance. In addition, CMDs can be computed very efficiently. Detecting with CMD is thus as fast as a standard forward pass through the classification model. Further studies in this direction may lead to improved techniques for both adversarial generation and detection.
-  (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In ICML, Cited by: §1.
-  (2019) Computing receptive fields of convolutional neural networks. Distill. Note: https://distill.pub/2019/computing-receptive-fields Cited by: §3.5.
Wasserstein generative adversarial networks. In ICML, Cited by: §3.2.
-  (2018) Demystifying mmd gans. ICLR. Cited by: §1, §3.2.
Adversarial examples are not easily detected: bypassing ten detection methods.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Cited by: Appendix C, §1, Figure 1, §3.3, §3.4, §3.5.
-  (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: Appendix A, 4th item, Appendix F, §1, Figure 2, §3.1, §4.2.
-  (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §2.
-  (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. NeurIPS. Cited by: §3.6.
-  (2015) On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields. Cited by: §1.
Dropout as a bayesian approximation: representing model uncertainty in deep learning. In
international conference on machine learning, pp. 1050–1059. Cited by: §2.
-  (2017) Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960. Cited by: §2.
-  (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §2.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: 1st item, §1, §3.1, §4.2.
-  (2017) On the (statistical) detection of adversarial examples. arXiv:1702.06280. Cited by: Appendix C, §1, §2, Figure 1, §3.1, §3.3, §3.4, §3.5.
-  (2015) Learning both weights and connections for efficient neural networks. NeurIPS. Cited by: §3.6.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1, §4.
-  (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §2, §3.1, Table 2.
-  (2020) Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10951–10960. Cited by: §2.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.
-  (2020) Two-sample testing using deep learning. In AISTATS, Cited by: §3.3.
-  (2002) Logistic regression. Springer. Cited by: Appendix D, §3.6.
-  (2016) Adversarial examples in the physical world. Cited by: Appendix A, 2nd item, §4.2.
-  (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §1.
-  (2017) Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325. Cited by: §2.
-  (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. NeurIPS. Cited by: Appendix A, Appendix C, 3rd item, Appendix F, §2, Table 1, §3.1, Table 2, §4.1, §4.1, §4.2, §4.2, §4.2, Table 3, §4.
-  (2017) Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5764–5772. Cited by: §2.
-  (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. ICLR. Cited by: §1, §2, Table 1, §3.1, Table 2, §4.1, §4.1, §4.2.
-  (2020) Learning deep kernels for non-parametric two-sample tests. In ICML, Cited by: §3.3.
-  (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. ICLR. Cited by: 3rd item, Appendix F, §2, Table 1, §4.2, Table 3, §4.
-  (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: 3rd item, §3.1, §4.2.
-  (1997) Integral probability metrics and their generating classes of functions. Advances in Applied Probability. Cited by: §1, §3.2.
-  (1975) Arguments for fisher’s permutation test. Annals of Statistics. Cited by: §3.3.
-  (2018) Towards robust detection of adversarial examples. In NeurIPS, Cited by: §1, §2, Table 1.
-  (2020) Uncertainty estimation using a single deep deterministic neural network. In International Conference on Machine Learning, pp. 9690–9700. Cited by: §1, Table 1.
Appendix A Robustness Against Adaptive Attacks
To further investigate the robustness of our method, we test our approach under the scenario where both our model and defense schema are oblivious to the attacker. Therefore, the attacker can perform an adaptive attack (i.e., an attack specifically designed to a given defense) to bypass our detector while deceiving the classification model through Algo. 2.
As seen in Algo. 2, we propose a novel objective function
to perform the adaptive attack, where represents the cross entropy loss, and is the CMD vector for input batch , as described in Algo. 1. The second term is a novel regularization. The detection schema in this paper is based on the fact that a legitimate input batch has smaller Channel Mean Discrepancies () than its adversarial (or OOD) counterpart. To bypass our detector, a straightforward method for attackers is to minimize the summed CMD when crafting the adversarial samples. By updating the input batch with a signed gradient (line 5 in Algo. 2) computed from , the optimized adversarial batch is expected to both deceive the classification model (implied by the first term in ) and possess a small summed CMD value (implied by the second term) simultaneously.
We set the maximum number of optimization iterations to 1,000 to ensure the efficacy of the adaptive attack. The of adversarial perturbation and step size are set as 0.25 and 0.01, respectively, following previous work . We vary the coefficient of CMD regularization to balance the two loss terms in Eq. (missing). When , the proposed adaptive attack regresses to a standard basic iterative method (BIM)  but with many more optimization iterations.
Notably, there may be other ways to construct an adaptive attack. For example, one adaptive attack may possible be built upon the CW attack . However, the CW attack introduces an additional coefficient during the optimization procedure, and it is notoriously difficult to search for two proper coefficients with which to conduct an adaptive attack. Moreover, standard BIM can achieve a comparable attack success rate against CW when is set to 0.25. Therefore, here, we only consider building an adaptive attack based on the BIM approach. We leave stronger adaptive attacks inspired by our proposed detection schema to future work.
We test the robustness of our approach against the adaptive attack by increasing coefficient , using ResNet-34 on the CIFAR-10 test set (containing 10,000 legitimate samples). We gradually increase the coefficient from 0, calculating the percentage of adversarial samples which can successfully deceive the classification model and bypass the detector out of the 10,000 adversarial samples crafted using Algo. 2. The result is shown in Fig. 4. For standard BIM (), while almost all adversarial samples can deceive the classification model, most () adversarial samples are detected by our detector. Thus, the overall success rate is about 0%. Increasing allows more adversarial samples to bypass the detector, but lowers the attack success rate. Consequently, when is very large, the overall success rate is quite small. This trend can be better observed from Fig. 5, where the success rate of deceiving classification model drops quickly as increases.
Note that, for the above experiment, we directly use the detector trained on standard BIM adversarial samples from Sec. 4.2. In practice, a defender can further fine-tune the detector to better defend against adaptive attacks with little training effort.
Preliminary results show our approach has significant potential for providing robustness for DNN classification models, even against adaptive attacks. Interestingly, the most suitable is , leading to an overall success rate of 16.76%. A smaller or a larger would lead to a lower overall success rate. Our empirical results indicate that there is a trade-off between deceiving a classification model and bypassing our proposed detection. It is difficult for attackers to achieve both simultaneously.
Notably, both determining a proper value and using a large number of optimization iterations for such an adaptive attack are very costly for an attacker, further disincentivizing attacks.
Appendix B Derivation of Eq. (missing)
We start from the general definition of IPMs,
We consider a specific class of functions , so we substitute in the above equation with ,
where the last equation is obtained using the Cauchy–Schwarz inequality (given and where is an finite constant).
The boundedness of is easy to guarantee. In our case, is a neural network, and its input (i.e., images) has a bounded norm. The boundedness of is thus ensured as long as the weights and biases of have finite norms.
Appendix C Details of Experiments in Fig. 1
We first use the CW attack to craft adversarial images based on the testing set of CIFAR-10 following the configurations found in . We implement both MMD and CMD using a pre-trained model, and CMD using a randomly initialized model to differentiate between legitimate and adversarial batches. To fairly compare the three methods, we use the same exact batches.
CMD using a pre-trained model
Given a legitimate batch and its corresponding adversarial batch, we use Algo. 1 to compute their CMD vectors. We sum over all elements in each CMD vector to get the summed CMD (which is a scalar). The -axis is the normalized summed CMD. Batches are normalized by dividing them by the largest summed CMD from across all batches.
CMD using a randomly initialized model
We repeat the above experiment but with a randomly initialized model. The model has the same architecture as the pre-trained model, but its weights and biases are randomly initialized using a Gaussian distribution. To obtain running average statistics in the BN layers of the randomly initialized model, we run the model on the original training set for 20 epochs. This updates the BN layers’ running average statistics but does not update the weights and biases.
MMD is first computed on a pair of batches. We introduce a third batch, consisting of legitimate samples, as a reference batch. Given an incoming batch (legitimate or adversarial), its MMD is computed by comparing it to a random reference batch. To make MMD more robust, we do not directly compute the MMD between the incoming and reference batches. Instead, we adopt Fisher’s permutation test as used by [14, 5]. Given an incoming batch and a reference batch , we initially let . Then, we randomly exchange samples between and to obtain two new batches, and , and let . If , then and are likely from different distributions. That is, the incoming batch is likely adversarial. The exchange and recomputation process is repeated 1000 times and the p-value is the fraction of times . The -axis in Fig. 1 represents the p-value. For incoming legitimate and adversarial batches, we expect the p-value to be close to 0 and 1, respectively. However, as shown in Fig. 1, MMD does not differentiate between legitimate and adversarial batches.
Appendix D Training Logistic Regression Model
Logistic regression is used to tune channel sensitivities to further improve the discriminatory ability of CMDs, as detailed in Sec. 3.6. Logistic regression can be trained efficiently using several optimization methods . We use multiple optimization methods,including L-BFGS and SAG implemented by sklearn and SGD implemented by PyTorch. Similar results were obtained from all of the optimization methods.
Appendix E Evaluation Metrics
To evaluate our experiments, we adopt three metrics:
True negative rate at true positive rate (TNR at TPR 95%): This metric is computed as when . , , , and represent true positive, true negative, false positive and false negative, respectively.
Area under the receiver operating characteristic curve (AUROC): The ROC curve is a graph of across different false positive rates. The false positive rate is computed as , and can be adjusted by varying the threshold setting.
Appendix F Implementation of adversarial attacks
We use four attack techniques from previous works:
Fast Gradient Sign Method (FGSM) 
Basic iterative method (BIM) 
Carlini-Wagner (CW) )
We implement each attack technique in a non-target setting [6, 29, 25] and use distance as a constraint . Details about perturbation and classification accuracy across the various attack methods, models, and datasets are included in Table 5.