Lightweight Detection of Out-of-Distribution and Adversarial Samples via Channel Mean Discrepancy

04/23/2021 ∙ by Xin Dong, et al. ∙ Harvard University The University of Texas at Dallas 6

Detecting out-of-distribution (OOD) and adversarial samples is essential when deploying classification models in real-world applications. We introduce Channel Mean Discrepancy (CMD), a model-agnostic distance metric for evaluating the statistics of features extracted by classification models, inspired by integral probability metrics. CMD compares the feature statistics of incoming samples against feature statistics estimated from previously seen training samples with minimal overhead. We experimentally demonstrate that CMD magnitude is significantly smaller for legitimate samples than for OOD and adversarial samples. We propose a simple method to reliably differentiate between legitimate samples from OOD and adversarial samples using CMD, requiring only a single forward pass on a pre-trained classification model per sample. We further demonstrate how to achieve single image detection by using a lightweight model for channel sensitivity tuning, an improvement on other statistical detection methods. Preliminary results show that our simple yet effective method outperforms several state-of-the-art approaches to detecting OOD and adversarial samples across various datasets and attack methods with high efficiency and generalizability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have achieved state-of-the-art accuracy on many classification problems, including speech recognition 

[1] and image classification [16]. However, recent studies have shown DNNs lack the ability to detect abnormal input samples and, as a result, output incorrect predictions for this with high confidence. The ability to detect abnormal samples is essential when deploying DNNs in real-world applications.

In this paper, we consider two types of abnormal input: adversarial samples and out-of-distribution samples. Adversarial samples are generated by intentionally perturbing legitimate samples with the goal of inducing errant predictions in the classification model [6]. Adversarial perturbations are usually small-scale and invisible to human observers. Out-of-distribution (OOD) samples are anomalous samples not belonging to the training data distribution  [27]

. For example, an image of a flower would be considered OOD for a classification model trained on bird images. Significantly, DNNs incorrectly classify OOD samples (from unknown classes) into known classes but with high confidence.

Several methods have been developed to address abnormal inputs. In one approach, extra modules are trained and used alongside classification models exclusively for the detection and rejection of abnormal inputs [34]. Other methods add abnormal samples (but with corrected labels) to standard training sets, and fine-tune classification models to obtain the correct results using the augmented training set [33]. However, those methods are computation- and data-intensive in practice. Furthermore, detection methods for OOD and adversarial samples are usually developed separately, increasing the burden of deployment. In this paper, we show that the statistics of feature embeddings in classification models are effective for detecting both adversarial and OOD samples in a lightweight and unified manner.

We observe that abnormal samples, both adversarial and OOD, belong to distributions distinct from the distribution for legitimate samples (i.e., the distribution of training set). Thus, detecting abnormal samples translates to measuring how far the input samples’ distribution deviates from the legitimate distribution.

Integral probability metrics (IPMs) [31] are a family of probability distance measures that are useful in this context. The idea of IPMs is simple: 1) project two sets of samples from two distributions to a new space using a mapping function, and 2) compare the statistics of the two sets in the projected space. If the statistics are similar, the two sets likely belong to the same distribution. IPMs include various popular distance measures, such as the Wasserstein distance [9] and maximum mean discrepancy (MMD) [4]. Previous studies have shown some preliminary results of adversarial detection using MMD. However, they conclude that MMD is not a suitable metric: it only performs well for detecting simple attacks (i.e., FGSM [13]

) on simple datasets (i.e., MNIST 

[23]) and requires large batch sizes [5, 14].

We improve MMD by replacing a non-parametric mapping function with a neural network, which is the original pre-trained classification model. The proposed novel metric, Channel Mean Discrepancy (CMD), can be computed in a straightforward and efficient manner. CMDs for a given batch of samples are computed by channel-wise comparison of feature means and running means. Feature means are derived from the channel-wise average of feature embeddings extracted by the classification model. Running means are retrieved from batch normalization layers and are reasonable estimations of statistics over legitimate distributions. We empirically show the arithmetic sum of CMDs over all channels is significantly smaller for legitimate batches than for abnormal batches. Compared to previous works, the minimum batch size for detection is reduced from 50 to 4. To achieve single sample detection (i.e., batch size

), we tune channel coefficients for the CMDs before performing a weighted sum according to each channel’s sensitivity to abnormal samples.

We extensively evaluate our proposed single sample detection method across various models, datasets, adversarial attacks, and OOD datasets. Our method consistently outperforms other methods and demonstrates high generalizability across attacks and OOD datasets.

In summary, the advantages of the novel metric CMD are:

  • CMD is model-agnostic and may be applied to pre-trained classification models without modifying model architecture or performing expensive fine-tuning.

  • For small batch sizes, the arithmetic sum of CMDs is already effective for measuring how far an incoming batch deviates from a legitimate distribution. Single sample detection is achieved by using a weighted sum of CMDs with tuned channel sensitivity coefficients. In addition, channel sensitivities tuned for one type of abnormal data generalizes well to other types of data.

  • Computing CMDs for a given batch is trivial, requiring only a single forward pass on the classification model being used.

2 Related Work

Previous work on detecting OOD and adversarial samples can be categorized into two classes. One class requires modifying model architectures or fine-tuning models. The other class requires only a pre-trained model.

Approaches requiring modifying model architectures or fine-tuning models.

To detect OOD samples, Gal et al. [10] propose using an ensemble of sub-networks to distinguish OOD samples by estimating the predictive uncertainty of each sample. Recent work [24, 7, 18] improves the ability to detect OOD samples by fine-tuning models with additional samples crafted using a GAN  [12]

or novel loss function.

To detect adversarial sample, recent work [14, 11, 26] proposes reconstructing models or training a secondary DNN from scratch for the express purpose of identifying adversarial samples. The most recent work [33] proposes retraining a DNN model using a reverse cross-entropy loss function and an embedded K-density detector.

Approaches with only pre-trained model

A set of work discusses techniques for detecting OOD and adversarial samples which do not require modifying model architecture or performing additional fine-tuning [17, 27, 25, 29]. Hendrycks et al. [17] propose a baseline approach with the maximum value of posterior distributions from DNNs. Liang et al. [27] build on Hendrycks et al.’s work by pre-processing the input and its corresponding output, but requires an additional forward and backward pass at inference time. Ma et al. [29] propose using the local intrinsic dimensionality (LID) property of inputs to distinguish adversarial samples, but adds overhead for computing the LID of each inner layer during inference. Recent work [25] by Lee et al.

distinguishes legitimate samples from OOD and adversarial samples by measuring the Mahalanobis distance of each input. However, this requires calculating the statistical characteristics (i.e., variance and mean) of each inner layer and multiple forward passes for each at inference time. For brevity, this technique is henceforth referred to as Mahalanobis throughout this paper. To the best of our knowledge, Mahalanobis is currently the only unified approach which can detect both OOD and adversarial samples.

DUQ [34] RCE [33] ODIN [27] LID [29] Mahalanobis [25] Ours
No fine-tuning
No architecture modification
Single pass during inference
No hyper-parameters
Out-of-distribution
Adversarial
Inference Latency (s) - - 3.10 8.49 2.83 0.85
Table 1: Comparison of approaches for detecting adversarial and out-of-distribution samples. “No fine-tuning” and “No architecture modification” indicate we may directly use a pre-trained classification model without needing to further fine-tune the classification model or modify its architecture. “Single pass during testing” means only one standard forward propagation through the classification model is required to detect the given input data. “No/light training” indicate that the detection module can be obtained with little or no training effort. The reported inference latency is obtained by running abnormal detection for a batch (batch size

) 100 times with ResNet-34 on CIFAR-10, detailed in

Sec. 4.3.

3 Methodology

3.1 Out-of-Distribution and Adversarial Samples

OOD samples

Out-of-distribution samples generally refer to samples that are from a different distribution than the training data [17, 27, 25]

. Samples from the SVHN dataset could be treated as OOD samples for a model trained on the CIFAR-10 dataset.

Adversarial samples

Adversarial samples are intentionally generated to induce misclassification of target model [6, 30, 13]. Adversarial samples are crafted by adding visually indistinguishable perturbations to legitimate samples, resulting in a significant change in the predictive output of the target model. Such procedure can be formulated as:

where and represent a target DNN and an adversarial sample, respectively. is used to evaluate the attack efficacy, as intended by the attacker [6, 14] . constrains the -norm distances between and , and is typically a small value.

3.2 Distribution Distances

Figure 1: Comparison of MMD against CMD when using a pre-trained model and randomly initialized model as the mapping function. All three methods are tested to discriminate the same legitimate and adversarial batches (batch size ) on CIFAR-10. The pre-trained model and the random model have the same architecture, ResNet-34. (a) MMD: we conduct the two-sample hypothesis testing with MMD and Fisher’s permutation following [5, 14]. The -axis is the predictive probability of being an adversarial batch. (b) CMD (Pre-trained Model): The novel distance metric defined in Eq. (missing), where the mapping function is instanced as the pre-trained classification model. The -axis is normalized summed CMDs. (c) CMD (Random Model): We validate the necessity of pre-trained model in (b) by replacing it with the same model but with random weights.

Consider two sets of samples: and from two different distributions and , respectively. Integral probability metrics (IPMs) [31] employ a class of witness functions that determine how close and are to each other, and chooses the function resulting in the largest discrepancy in expectation over and ,

(1)
(2)

where Eq. (missing) is derived from empirical estimations of expectation in Eq. (missing). satisfies three axioms: non-negativity, symmetry, and triangle inequality, and is a pseudo-metric where but does not necessarily imply .

IPMs are a general framework, so various probability metrics can be defined by choosing an appropriate class of witness functions. For example, the Wasserstein-1 metric is defined using 1-Lipschitz functions ([3]. By using , we obtain maximum mean discrepancy (MMD), where represents a reproducing kernel Hilbert space with as its reproducing kernel [4].

Figure 2: A classification model pre-trained on CIFAR-10 demonstrates substantially different feature embeddings for a legitimate sample (first row) when compared to its corresponding adversarial counterpart (third row). The legitimate sample is randomly selected from the CIFAR-10 testing set and the adversarial sample is crafted using the CW attack [6]. Although the model is sensitive to the adversarial sample, it is robust to the sample perturbed with random noise (second row). The random noise has the same scale as the adversarial perturbation. The feature maps visualized in this figure are three randomly selected channels from the last three convolution layers of ResNet-34.

3.3 Detection with Maximum Mean Discrepancy

A specific widely-used kernel in MMD is the Gaussian kernel: . The corresponding implicit feature map has an infinite number of output dimensions when using Taylor expansion. Previous work shows that MMD is able to detect adversarial samples using Fisher’s permutation test [32]. To do this, we initially compute , where is the batch to be detected and is a random legitimate batch as reference. Then, we mutually exchange samples from and to create two new sets, and , and recompute . This exchange and recomputation process is repeated 1000 times. If the number of exchanged MMDs larger than exceeds a pre-defined threshold, is determined to be a batch consisting entirely of adversarial samples. For simplicity, the sizes of and are the same and referred to as the batch size. Preliminary results in [14] show that MMD can be used to find adversarial batches when batch size is sufficiently large (e.g., 50 for an FGSM attack on MNIST). However, this method fails against stronger attacks (e.g., CW attack) or on more complex datasets (e.g., CIFAR-10) [5].

The failure of MMD with a Gaussian kernel is due to its limitations: 1) it has insufficient capacity to handle high dimensional data with complex distributions 

[28] and 2) it cannot effectively model semantic information for images [20].

To address the limitations of MMD in detecting abnormal samples, we describe an improved IPM that leverages well-trained neural networks in the following section. The proposed metric tremendously reduces required batch size and obviates the need for the costly Fisher’s permutation process.

Figure 3: Concept illustration of receptive fields in CNN layers. A receptive field is a region in the input (large boxes on left) corresponding to a particular element (small boxes on right) in the feature embeddings. Receptive field size increases linearly with layer depth in the CNN. As a result, computing the channel mean by averaging elements can be thought of as averaging over several sub-images.
1:Input: Input batch (batch size can be 1), pre-trained classification model
2:

Initialize an empty CMD vector:

3:Do the forward pass for with :
4:Record output feature embeddings of all convolutional layers: .
5:for  in  do
6:     for each channel in  do
7:         Compute channel mean discrepancy:    // where is the channel    mean computed from batch and is the corresponding running mean stored in the BN layer.
8:         Append to
9:     end for
10:end for
11:return The CMD vector consisting of CMDs across channels and layers:
Algorithm 1 Computing the channel mean discrepancy’s across channels and layers

3.4 Detection with Channel Mean Discrepancy

We first choose a class of functions with the form for use with IPMs. Here, is a scalar vector with bounded norm , and is a neural network parameterized by . Given , we see is a finite dimensional Hilbert space. The metric indexed by is given by:

(3)
(4)
(5)

In Eq. (missing), the maximum is taken over the mapping function , which is a neural network. However, there is no need to optimize to maximize Eq. (missing): we simply need a suitable mapping function capable of differentiating samples from two distributions and . We see a pre-trained classification model fits this criteria. As illustrated in Fig. 2, while adversarial and legitimate samples are visually indistinguishable, the features extracted by the classification model differ significantly. We also show that a random network with the same architecture cannot reveal feature space differences between adversarial and legitimate samples in Fig. 1. Based on analysis of the pre-trained classification model, we define a novel metric

(6)

where is the mean of the feature embeddings extracted from

. In this paper, we consider the classification model as a convolutional neural network and so only compute the mean of feature embeddings in a channel-wise manner. We call the metric defined in

Eq. (missing) the Channel Mean Discrepancy (CMD).

Given a legitimate distribution , the second term in Eq. (missing) is ideally the mean of the feature embeddings across the entire training set. However, we find the channel-wise moving average stored in the widely-used Batch Normalization (BN) layer serves as a reasonable approximation of . A BN layer normalizes the feature embeddings to alleviate covariate shifts. During the training phase, a BN layer keeps running estimates of channel-wise feature mean (and variance) computed on the training batches, which are then used for normalization during inference. Given an input batch, the CMDs are computed by: 1) passing the batch to the classification model , 2) computing the batch mean along each channel of features, and 3) comparing the difference between the batch means and the moving averages from the BN layers.

Since adversarial samples are crafted to significantly impact the prediction of classification models, we can expect to be significantly larger than . In practice, a convolution neural network usually has several convolutional layers, and thus can extract multi-level features to compute their corresponding CMDs. We sum all available CMDs, computed across various channels and layers, to make the difference between and more pronounced.

Fig. 1 (b) shows the effectiveness of summed CMDs for distinguishing adversarial batches from legitimate batches. The adversarial samples are crafted via the CW attack with a batch size of 8. In contrast, MMD fails to differentiate adversarial and legitimate batches under the same setting as Fig. 1 (a). For detection with MMD, we use the procedure and hyper-parameters detailed in Sec. 3.3 and [5, 14]. To further validate the necessity of using the pre-trained classification model as the mapping function in Eq. (missing), we replace the pre-trained classification model with a randomly weighted counterpart. As shown in Fig. 1 (c), summed CMDs for a randomly weighted model cannot effectively differentiate adversarial and legitimate batches. More details on the experiments in Fig. 1 can be found in Appendix.

3.5 Effect of Batch Size

Consider a specific channel in the feature embeddings whose mean is computed over all elements from different spatial positions in the channel and different samples in a batch. Each element responds to a corresponding region in a raw image, as illustrated in Fig. 3, known as the receptive field [2]. Receptive field size increases linearly with layer depth in the CNN. As a result, computing the channel mean by averaging elements across spatial positions can be thought of as averaging over several sub-images. In addition, multiple channels from one feature embedding and multi-level feature embeddings from layers at different depths in the model can be further utilized to increase the versatility of CMDs. This implicitly augments the input batch, and thus enabling CMDs to survive with extremely small batch sizes when compared to previous MMD-based methods.

Given a batch of input samples, we use the sum of all available CMDs as a metric of how far samples in the batch deviate from the legitimate training distribution.

As a result, the minimum batch size required by CMD is much smaller than with MMD. [14] reports the minimum batch size to differentiate adversarial and legitimate batches is 50 for a FGSM attack on MNIST. However, it will fail to detect attacks on CIFAR-10 even when using a batch size of 10,000 [5]. In contrast, our summed CMD is able to differentiate adversarial and legitimate batches with batch sizes as small as 4 for a CW attack on CIFAR-10. We highlight two aspects of our approach which contribute to the smaller achievable batch size: 1) CMD uses a well-trained CNN as a much more powerful mapping function to model semantic information by considering pixel correlations and 2) multi-level and multi-channel CMDs are equivalent to augment the input batch with sub-figures with different scales as illustrated in Fig. 3.

Although CMD significantly reduces required batch size, the ultimate goal is to achieve single image detection (i.e., batch size ). Using summed CMDs, the detection AUROC drops from 99.31% to 82.11% as the batch size decreases from 4 to 1. In the following section, we further improve summed CMDs with a lightweight detection model to achieve single image detection.

3.6 Sensitivity-Aware CMD

Previous work shows that various channels in CNN are of different magnitudes and importance [15, 8]

. Instead of naively treating all CMDs equally, we learn a coefficient for each CMD before summing. Since detection of abnormal samples is a binary classification task, learning these coefficients translates to fitting a simple logistic regression (LR) detection model

. The input to the LR model is a vector consisting of all available CMDs for the classification model and a given sample. The LR model performs a dot product on the input with the learned coefficient vector

and outputs the prediction score after applying the sigmoid function,

LR model is lightweight and can be trained efficiently by several optimization methods [21]. In addition, training (and inference) of the LR model requires no raw samples but just vectors of CMDs which are cheap to be obtained by a single forward propagation with the classification model given raw samples.

In-dist
(model)
Train
on
OOD Baseline [17] ODIN [27] Mahalanobis [25] Ours
TNR at TPR 95% / AUROC / Detection acc.
CIFAR-10
(ResNet-34)
SVHN SVHN 32.5 / 89.9 / 85.1 86.6 / 96.7 / 91.1 96.4 / 99.1 / 95.8 100 / 99.9 / 99.5
LSUN 45.4 / 91.0 / 85.3 13.8 / 81.1 / 55.5 31.2 / 86.8 / 79.6 85.8 / 97.1 / 91.2
TinyImageNet 44.7 / 91.0 / 85.1 16.3 / 82.9 / 77.9 40.3 / 88.4 / 81.0 83.3 / 96.7 / 90.1
CIFAR-10
(ResNet-34)
LSUN LSUN 45.4 / 91.0 / 85.3 73.8 / 94.1 / 86.7 98.9 /99.7 / 97.7 100 / 99.9 / 99.8
SVHN 32.5 / 89.9 / 85.1 35.5 / 83.6 / 74.6 80.4 / 96.8 / 91.8 92.2 / 98.0 / 93.7
TinyImageNet 44.7 / 91.0 / 85.1 72.5 / 94.0/ 86.5 96.9 / 99.5 / 96.2 99.9 / 99.9 / 98.9
CIFAR-100
(ResNet-34)
SVHN SVHN 20.3 / 79.5 / 73.2 62.7 / 93.9 / 88.0 91.9 / 98.4 / 93.7 100 / 99.9 / 99.4
LSUN 18.8 / 75.8 / 69.9 2.7 / 51.7 / 52.1 1.8 / 44.0 / 50.0 53.6 / 90.6 / 83.4
TinyImageNet 20.4 / 77.2 / 70.8 4.1 / 57.0 / 55.9 7.1 / 58.9 / 58.7 55.5 / 91.2 / 83.5
CIFAR-100
(ResNet-34)
LSUN LSUN 18.8 / 75.8 / 69.9 45.6/ 85.6 / 78.3 90.9 / 98.2 / 93.5 100 / 99.9 / 99.8
SVHN 20.3 / 79.5 / 73.2 16.5 / 70.7 / 65.6 55.1 / 92.4 / 85.4 79.8 / 94.2 / 88.8
TinyImageNet 20.4 / 77.2 / 70.8 48.5 / 87.8 / 80.3 89.9 / 98.0 / 92.9 99.9 / 99.9 / 98.7
CIFAR-10
(DenseNet-100)
SVHN SVHN 40.2 / 89.9 / 83.2 86.2 / 95.5 / 91.4 90.8 / 98.1 / 93.9 99.9 / 99.9 / 99.3
LSUN 66.4 / 95.4 / 90.3 78.1 / 94.6 / 91.2 42.9 / 84.9 / 80.1 91.6 / 98.2 / 93.4
TinyImageNet 58.8 / 94.1 / 88.5 68.2 / 91.2 / 88.4 43.3 / 77.7 / 75.9 82.5 / 96.8 / 90.9
CIFAR-10
(DenseNet-100)
LSUN LSUN 66.6 / 95.4 / 90.3 96.2 / 99.2 / 95.7 97.2 / 99.3 / 96.3 99.9 / 99.9 / 99.6
SVHN 40.0 / 89.9 / 83.2 37.5 / 87.3 / 79.6 12.2 / 45.9 / 54.6 86.1 / 96.4 / 91.1
TinyImageNet 58.8 / 94.1 / 88.5 91.8 / 98.4 / 93.8 92.2 / 97.1 / 93.6 99.9 / 99.9 / 98.6
CIFAR-100
(DenseNet-100)
SVHN SVHN 26.7 / 82.7 / 75.6 70.6 / 93.8 / 86.6 82.5 / 97.2 / 91.5 99.9 / 99.9 / 99.1
LSUN 16.7 / 70.8 / 64.9 6.0 / 67.2 / 64.0 47.1 / 82.1 / 77.4 70.4 / 94.7 / 87.8
TinyImageNet 17.5 / 71.6 / 65.7 7.5 / 69.1 / 65.0 52.1 / 80.6 / 77.1 57.2 / 92.6 / 85.4
CIFAR-100
(DenseNet-100)
LSUN LSUN 16.7 / 70.8 / 64.9 41.2 / 85.5 / 77.1 91.4 / 98.0 / 93.9 100 / 99.9 / 99.7
SVHN 26.7 / 82.7 / 75.6 26.1 / 85.2 / 78.7 64.0 / 90.3 / 84.2 78.2 / 93.3 / 88.2
TinyImageNet 17.5 / 71.6 / 65.7 41.1 / 84.8 / 76.6 83.0 / 94.1 / 89.6 99.8 / 99.6 / 98.4
Table 2: Comparison of detecting OOD samples with other methods. All values are percentages, and larger value is better. The best results are highlighted in bold.

4 Experiments

Model
Dataset
Method No Transfer Transfer
FGSM BIM DeepFool CW FGSM BIM DeepFool CW
ResNet
CIFAR-10
LID [29] 99.69 96.28 88.51 82.23 99.69 95.38 71.86 77.53
Mahalanobis [25] 99.94 99.57 91.57 95.84 99.94 98.91 78.06 93.90
Ours 100.0 100.0 97.48 99.94 99.99 99.98 83.94 99.94
CIFAR-100
LID [29] 98.73 96.89 71.95 78.67 98.73 55.82 63.15 75.03
Mahalanobis [25] 99.77 96.90 85.26 91.77 99.77 96.38 81.95 90.96
Ours 100.0 100.0 89.63 99.84 100.0 99.66 82.13 99.84
SVHN
LID [29] 97.86 90.74 92.40 88.24 97.86 84.88 67.28 76.58
Mahalanobis [25] 99.62 97.15 95.73 92.15 99.62 95.39 72.20 86.73
Ours 100.0 99.98 98.70 99.68 99.85 99.93 80.66 99.68
DenseNet
CIFAR-10
LID [29] 98.20 99.74 85.14 80.05 98.20 94.55 70.86 71.50
Mahalanobis [25] 99.94 99.78 83.41 87.31 99.94 99.51 83.42 87.95
Ours 100.0 100.0 96.76 99.91 100.0 99.51 93.31 99.91
CIFAR-100
LID [29] 99.35 98.17 70.17 73.37 99.35 68.62 69.68 72.36
Mahalanobis [25] 99.86 99.17 77.57 87.05 99.86 98.27 75.63 86.20
Ours 100.0 100.0 95.35 99.84 100.0 99.97 88.06 99.84
SVHN
LID [29] 99.35 94.87 91.79 94.70 99.35 92.21 80.14 85.09
Mahalanobis [25] 99.85 99.28 95.10 97.03 99.85 99.12 93.47 96.95
Ours 100.0 100.0 98.77 99.98 100.0 99.91 92.33 99.98
Table 3: Comparison of detecting adversarial samples with other methods. Values reported in the table are AUROC on testing set consisting of legitimate, adversarial and noisy samples. All values are percentages, and larger value is better. The best results are highlighted in bold. ‘No Transfer’ means that training and inference for the logistic regression are conducted with the same attack method. The ‘Transfer’ results reported are for Mahalanobis trained on FGSM, and our method trained on CW.

We evaluate our technique using multiple classification models, including ResNet-34 [16] and DenseNet-100 [19] trained on CIFAR-10, CIFAR-100, and SVHN datasets. To validate the generality of our method and for fairness, we use pre-trained weights directly downloaded from previous works111github.com/pokaxpoka/deep_Mahalanobis_detector/

or open-source implementations

222github.com/osmr/imgclsmob for all classification models. Consistent with the configurations from [25, 29]

, we adopt the following evaluation metrics: the true negative rate (TNR) at

, true positive rate (TPR), area under the receiver operating characteristic curve (AUROC), and detection accuracy. We compare against state-of-the-art methods which do not require modifying model architectures or fine-tuning classification models because they require a similar magnitude of training effort to our own.

4.1 Detecting Out-of-distribution Samples

We evaluate the discriminatory power of CMD on four OOD datasets, CIFAR-10, SVHN, TinyImageNet, and LSUN. [27]

The TinyImageNet dataset is a subset of ImageNet, and the LSUN dataset contains scene images from 10 different categories. To reconcile image size differences for CIFAR-10 and SVHN, images in TinyImageNet and LSUN are resized to

pixels. We refer readers to [27] for more details. In each experiment, the training dataset used for the classification model is considered the in-distribution dataset.

For our method, we set batch size to 1 and use sensitivity-aware CMD, as detailed in Sec. 3.6, to detect OOD samples. To train the logistic regression model for sensitivity-aware CMD, we randomly select 2000 samples from the in-distribution dataset and another 2000 samples from the OOD dataset, following the same settings as in [25]. We generate corresponding CMD vectors for each of the 4000 images for logistic regression training. Detection performance of the logistic regression is tested on a mixture of the in-distribution testing set and the OOD testing set. In real-world applications, a classification model does not know the type of OOD samples in advance. Thus, we are interested in how a logistic regression model, trained on one type of OOD dataset, performs on another OOD dataset. As shown in Table 2, sensitivity-aware CMD consistently outperforms the compared methods on various datasets and classification models by a large margin, and expresses higher transferability across OOD datasets.

In addition, our method does not require any hyper-parameters, in contrast to both ODIN [27] and Mahalanobis [25]. For all results in Table 2, hyper-parameters in ODIN and Mahalanobis are carefully tuned on validation sets via a grid search following their official implementations.

4.2 Detecting Adversarial Samples

We evaluate our approach using four attack methods: FGSM [13], BIM [22], DeepFool [30] and CW [6]. Consistent with previous work [27, 25], the datasets used in training and inference consist of three kinds of data: 1) legitimate samples from the original training and testing sets, 2) adversarial samples crafted from legitimate samples using the specific attack methods, and 3) noisy samples crafted by adding random noise to legitimate samples. When crafting adversarial and noisy samples, we constrain the scale of adversarial and noisy perturbations as described in [25]. Note that all adversarial samples yield different prediction results when compared to their corresponding legitimate counterparts, but the noisy samples yield identical (correct) results to their counterpart. A robust detection method should be able to classify noisy samples as normal ones.

Similar to [25], we randomly select 2000 legitimate samples from the original training dataset and their corresponding adversarial and noisy samples to train the logistic regression model. A mixture of legitimate, adversarial, and noisy testing samples is used to evaluate the methods.

We also evaluate transferability of the logistic regression model. Specifically, a logistic regression trained on one type of attack is also evaluated on the other attack types. We find that the logistic regression model trained on the CW attack has the best transferability. Mahalanobis transfers best to other attacks when trained using the FGSM attack. Thus, the ‘transfer’ results reported in Table 3 are for Mahalanobis trained on FGSM, and our method trained on CW.

We compare our method with LID [29] and Mahalanobis [25]. Similar to above, hyper-parameters for LID and Mahalanobis are carefully tuned via grid search, and their results are reported using optimal hyper-parameters. As shown in Table 3, our method consistently outperforms LID and Mahalanobis in both ‘no transfer’ and ‘transfer’ settings.

4.3 Inference Cost

Our proposed method is lightweight and efficient: for an incoming sample, only a single forward pass on the pre-trained classification model is required to compute its CMD vector. The extra overhead introduced by our detection method is negligible during inference. We compare our method with other approaches in terms of the inference latency. Specifically, we test each method with eight samples (i.e., batch size ) and repeat the inference procedure for 100 times. Testing hardware included an Intel Xeon(R) CPU E5-2650 v4 CPU, NVIDIA 1080 Ti GPU, and 64 GB RAM. The times for 100 runs are reported in  Table 1. We find our method outperforms all other methods for inference cost, because our approach only requires one single forward pass for computing CMD vectors. In contrast, the other approaches (i.e., ODIN, Mahalanobis and LID) typically require either: 1) at least one additional forward and backward pass for adding noise to inputs or 2) searching several nearest neighbours by measuring the activation values of multiple inner layers.

Train
on
OOD ResNet-34 VGG-19
TNR at TPR 95% / AUROC / Detection acc.
LSUN LSUN 100 / 99.9 / 99.8 100 / 99.8 / 99.3
SVHN 92.2 / 98.0 / 93.7 99.9 / 99.9 / 99.3
TinyImageNet 99.9 / 99.9 / 98.9 100 / 99.8 / 99.3
Table 4: Detecting OOD samples with VGG-19.

4.4 Applying CMD on Models without BN Layers

CMD retrieves running average statistics stored in the BN layers to approximate the statistics of feature embeddings ( in Eq. (missing)) across the entire legitimate training set. If a classification model does not have BN layers, we can manually estimate by traversing the entire training set with the classification model and computing the running average statistics for each channel.

We use VGG-19 trained on CIFAR-10 as an example. VGG-19 has 16 convolutional layers and 3 fully-connected layers, and no BN layers. We use the aforementioned method to compute the running average statistics before repeating the OOD experiments. As shown in Table 4, detection performance does not degrade at all.

5 Conclusion

Abnormal samples, both adversarial and out-of-distribution, belong to distributions distinct from the distribution of legitimate samples. Inspired by integral probability metrics, we proposed a novel metric, Channel Mean Discrepancy (CMD), for evaluating how far input samples’ distributions deviate from a legitimate distribution. We showed how CMD could be used to reliably and efficiently identify both out-of-distribution and adversarial samples in a unified way. From a theoretical perspective, we have provided an intuition as to why CMD can be computed by channel-wise comparison of feature means and running means (retrieved from batch normalization layers). Our empirical results validate the effectiveness and generalizability of CMD, and a simple logistic regression model with CMDs can achieve state-of-the-art detection performance. In addition, CMDs can be computed very efficiently. Detecting with CMD is thus as fast as a standard forward pass through the classification model. Further studies in this direction may lead to improved techniques for both adversarial generation and detection.

References

  • [1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In ICML, Cited by: §1.
  • [2] A. Araujo, W. Norris, and J. Sim (2019) Computing receptive fields of convolutional neural networks. Distill. Note: https://distill.pub/2019/computing-receptive-fields Cited by: §3.5.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou (2017)

    Wasserstein generative adversarial networks

    .
    In ICML, Cited by: §3.2.
  • [4] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. ICLR. Cited by: §1, §3.2.
  • [5] N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    ,
    Cited by: Appendix C, §1, Figure 1, §3.3, §3.4, §3.5.
  • [6] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: Appendix A, 4th item, Appendix F, §1, Figure 2, §3.1, §4.2.
  • [7] T. DeVries and G. W. Taylor (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §2.
  • [8] X. Dong, S. Chen, and S. J. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. NeurIPS. Cited by: §3.6.
  • [9] N. Fournier and A. Guillin (2015) On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields. Cited by: §1.
  • [10] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    .
    In

    international conference on machine learning

    ,
    pp. 1050–1059. Cited by: §2.
  • [11] Z. Gong, W. Wang, and W. Ku (2017) Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960. Cited by: §2.
  • [12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §2.
  • [13] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: 1st item, §1, §3.1, §4.2.
  • [14] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv:1702.06280. Cited by: Appendix C, §1, §2, Figure 1, §3.1, §3.3, §3.4, §3.5.
  • [15] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural networks. NeurIPS. Cited by: §3.6.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §1, §4.
  • [17] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §2, §3.1, Table 2.
  • [18] Y. Hsu, Y. Shen, H. Jin, and Z. Kira (2020) Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10951–10960. Cited by: §2.
  • [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.
  • [20] M. Kirchler, S. Khorasani, M. Kloft, and C. Lippert (2020) Two-sample testing using deep learning. In AISTATS, Cited by: §3.3.
  • [21] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein (2002) Logistic regression. Springer. Cited by: Appendix D, §3.6.
  • [22] A. Kurakin, I. Goodfellow, S. Bengio, et al. (2016) Adversarial examples in the physical world. Cited by: Appendix A, 2nd item, §4.2.
  • [23] Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §1.
  • [24] K. Lee, H. Lee, K. Lee, and J. Shin (2017) Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325. Cited by: §2.
  • [25] K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. NeurIPS. Cited by: Appendix A, Appendix C, 3rd item, Appendix F, §2, Table 1, §3.1, Table 2, §4.1, §4.1, §4.2, §4.2, §4.2, Table 3, §4.
  • [26] X. Li and F. Li (2017) Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5764–5772. Cited by: §2.
  • [27] S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. ICLR. Cited by: §1, §2, Table 1, §3.1, Table 2, §4.1, §4.1, §4.2.
  • [28] F. Liu, W. Xu, J. Lu, G. Zhang, A. Gretton, and D. J. Sutherland (2020) Learning deep kernels for non-parametric two-sample tests. In ICML, Cited by: §3.3.
  • [29] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. ICLR. Cited by: 3rd item, Appendix F, §2, Table 1, §4.2, Table 3, §4.
  • [30] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: 3rd item, §3.1, §4.2.
  • [31] A. Müller (1997) Integral probability metrics and their generating classes of functions. Advances in Applied Probability. Cited by: §1, §3.2.
  • [32] A. Odén, H. Wedel, et al. (1975) Arguments for fisher’s permutation test. Annals of Statistics. Cited by: §3.3.
  • [33] T. Pang, C. Du, Y. Dong, and J. Zhu (2018) Towards robust detection of adversarial examples. In NeurIPS, Cited by: §1, §2, Table 1.
  • [34] J. Van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal (2020) Uncertainty estimation using a single deep deterministic neural network. In International Conference on Machine Learning, pp. 9690–9700. Cited by: §1, Table 1.

Appendix A Robustness Against Adaptive Attacks

To further investigate the robustness of our method, we test our approach under the scenario where both our model and defense schema are oblivious to the attacker. Therefore, the attacker can perform an adaptive attack (i.e., an attack specifically designed to a given defense) to bypass our detector while deceiving the classification model through Algo. 2.

1:Input: Legitimate input batch with ground truth label , pre-trained classification model , step size , coefficient for the summed CMD’s loss term, maximum optimization iterations , and perturbation constraint
2:Initialize and : ,
3:while  do
4:     Compute the loss:
5:     Update :
6:     Constrain the of perturbation:
7:     
8:end while
9:Return: optimized adversarial batch
Algorithm 2 Adaptive attack

As seen in Algo. 2, we propose a novel objective function

(7)

to perform the adaptive attack, where represents the cross entropy loss, and is the CMD vector for input batch , as described in Algo. 1. The second term is a novel regularization. The detection schema in this paper is based on the fact that a legitimate input batch has smaller Channel Mean Discrepancies () than its adversarial (or OOD) counterpart. To bypass our detector, a straightforward method for attackers is to minimize the summed CMD when crafting the adversarial samples. By updating the input batch with a signed gradient (line 5 in Algo. 2) computed from , the optimized adversarial batch is expected to both deceive the classification model (implied by the first term in ) and possess a small summed CMD value (implied by the second term) simultaneously.

Figure 4: Overall success rate under various coefficients . The overall success rate is the number of adversarial samples which both deceive the classification model and bypass our detector to the number of adversarial samples crafted via Algo. 2 (10,000).
Figure 5: Detection AUROC and attack success rate under various coefficients . The attack success rate is the number of adversarial samples which deceive the classification model to the number of adversarial samples crafted via Algo. 2 (10,000).

We set the maximum number of optimization iterations to 1,000 to ensure the efficacy of the adaptive attack. The of adversarial perturbation and step size are set as 0.25 and 0.01, respectively, following previous work [25]. We vary the coefficient of CMD regularization to balance the two loss terms in Eq. (missing). When , the proposed adaptive attack regresses to a standard basic iterative method (BIM) [22] but with many more optimization iterations.

Notably, there may be other ways to construct an adaptive attack. For example, one adaptive attack may possible be built upon the CW attack [6]. However, the CW attack introduces an additional coefficient during the optimization procedure, and it is notoriously difficult to search for two proper coefficients with which to conduct an adaptive attack. Moreover, standard BIM can achieve a comparable attack success rate against CW when is set to 0.25. Therefore, here, we only consider building an adaptive attack based on the BIM approach. We leave stronger adaptive attacks inspired by our proposed detection schema to future work.

We test the robustness of our approach against the adaptive attack by increasing coefficient , using ResNet-34 on the CIFAR-10 test set (containing 10,000 legitimate samples). We gradually increase the coefficient from 0, calculating the percentage of adversarial samples which can successfully deceive the classification model and bypass the detector out of the 10,000 adversarial samples crafted using Algo. 2. The result is shown in Fig. 4. For standard BIM (), while almost all adversarial samples can deceive the classification model, most () adversarial samples are detected by our detector. Thus, the overall success rate is about 0%. Increasing allows more adversarial samples to bypass the detector, but lowers the attack success rate. Consequently, when is very large, the overall success rate is quite small. This trend can be better observed from Fig. 5, where the success rate of deceiving classification model drops quickly as increases.

Note that, for the above experiment, we directly use the detector trained on standard BIM adversarial samples from Sec. 4.2. In practice, a defender can further fine-tune the detector to better defend against adaptive attacks with little training effort.

Preliminary results show our approach has significant potential for providing robustness for DNN classification models, even against adaptive attacks. Interestingly, the most suitable is , leading to an overall success rate of 16.76%. A smaller or a larger would lead to a lower overall success rate. Our empirical results indicate that there is a trade-off between deceiving a classification model and bypassing our proposed detection. It is difficult for attackers to achieve both simultaneously.

Notably, both determining a proper value and using a large number of optimization iterations for such an adaptive attack are very costly for an attacker, further disincentivizing attacks.

Appendix B Derivation of Eq. (missing)

We start from the general definition of IPMs,

We consider a specific class of functions , so we substitute in the above equation with ,

where the last equation is obtained using the Cauchy–Schwarz inequality (given and where is an finite constant).

The boundedness of is easy to guarantee. In our case, is a neural network, and its input (i.e., images) has a bounded norm. The boundedness of is thus ensured as long as the weights and biases of have finite norms.

CIFAR-10 CIFAR-100 SVHN
Acc. Acc. Acc.
DenseNet Clean 0 95.23% 0 77.61% 0 96.38%
FGSM 0.21 20.04% 0.21 4.86% 0.21 56.27%
BIM 0.22 0.00% 0.22 0.02% 0.22 0.67%
DeepFool 0.30 0.23% 0.25 0.23% 0.57 0.50%
CW 0.05 0.10% 0.03 0.16% 0.12 0.54%
ResNet Clean 0 93.67% 0 78.34% 0 96.68%
FGSM 0.25 23.98% 0.25 11.67% 0.25 49.33%
BIM 0.26 0.02% 0.26 0.21% 0.26 2.37%
DeepFool 0.36 0.33% 0.27 0.37% 0.62 13.20%
CW 0.08 0.00% 0.08 0.01% 0.15 0.04%
Table 5: The perturbation and classification accuracy on clean and adversarial samples.

Appendix C Details of Experiments in Fig. 1

We first use the CW attack to craft adversarial images based on the testing set of CIFAR-10 following the configurations found in [25]. We implement both MMD and CMD using a pre-trained model, and CMD using a randomly initialized model to differentiate between legitimate and adversarial batches. To fairly compare the three methods, we use the same exact batches.

CMD using a pre-trained model

Given a legitimate batch and its corresponding adversarial batch, we use Algo. 1 to compute their CMD vectors. We sum over all elements in each CMD vector to get the summed CMD (which is a scalar). The -axis is the normalized summed CMD. Batches are normalized by dividing them by the largest summed CMD from across all batches.

CMD using a randomly initialized model

We repeat the above experiment but with a randomly initialized model. The model has the same architecture as the pre-trained model, but its weights and biases are randomly initialized using a Gaussian distribution. To obtain running average statistics in the BN layers of the randomly initialized model, we run the model on the original training set for 20 epochs. This updates the BN layers’ running average statistics but does not update the weights and biases.

Mmd

MMD is first computed on a pair of batches. We introduce a third batch, consisting of legitimate samples, as a reference batch. Given an incoming batch (legitimate or adversarial), its MMD is computed by comparing it to a random reference batch. To make MMD more robust, we do not directly compute the MMD between the incoming and reference batches. Instead, we adopt Fisher’s permutation test as used by [14, 5]. Given an incoming batch and a reference batch , we initially let . Then, we randomly exchange samples between and to obtain two new batches, and , and let . If , then and are likely from different distributions. That is, the incoming batch is likely adversarial. The exchange and recomputation process is repeated 1000 times and the p-value is the fraction of times . The -axis in Fig. 1 represents the p-value. For incoming legitimate and adversarial batches, we expect the p-value to be close to 0 and 1, respectively. However, as shown in Fig. 1, MMD does not differentiate between legitimate and adversarial batches.

Appendix D Training Logistic Regression Model

Logistic regression is used to tune channel sensitivities to further improve the discriminatory ability of CMDs, as detailed in Sec. 3.6. Logistic regression can be trained efficiently using several optimization methods [21]. We use multiple optimization methods,including L-BFGS and SAG implemented by sklearn and SGD implemented by PyTorch. Similar results were obtained from all of the optimization methods.

Appendix E Evaluation Metrics

To evaluate our experiments, we adopt three metrics:

  • True negative rate at true positive rate (TNR at TPR 95%): This metric is computed as when . , , , and represent true positive, true negative, false positive and false negative, respectively.

  • Area under the receiver operating characteristic curve (AUROC): The ROC curve is a graph of across different false positive rates. The false positive rate is computed as , and can be adjusted by varying the threshold setting.

  • Detection accuracy: Consistent with previous works [25, 29], we use the maximum classification accuracy from various threshold settings as the detection accuracy.

Appendix F Implementation of adversarial attacks

We use four attack techniques from previous works:

  • Fast Gradient Sign Method (FGSM) [13]

  • Basic iterative method (BIM) [22]

  • DeepFool [30]

  • Carlini-Wagner (CW) [6])

We implement each attack technique in a non-target setting [6, 29, 25] and use distance as a constraint [25]. Details about perturbation and classification accuracy across the various attack methods, models, and datasets are included in Table 5.