Normal examples are sampled from the natural data generating process. For examples, images of handwritten digits.
Outliers are examples in the training set of normal examples. They are difficult to classify by humans. Sanitization is the process of removing outliers from the training set.
Adversarial examples are crafted by attackers that are perceptually close to normal examples but that cause misclassification.
Unsanitized models are trained with all the examples in the training set. We assume that the training set contains only normal examples.
Sanitized models are trained with the remaining examples after we remove outliers from the training set.
Sanitization is the process of removing outliers from the training set. We propose two automatic sanitization methods.
This approach applies to the AI tasks that have canonical examples. For example, for handwritten digit, computer fonts may be considered canonical examples222Some computer fonts are difficult to recognize and therefore are excluded from our evaluation. Based on this observation, we use canonical examples to discard outliers in our training set by the following steps:
Augment the set of canonical examples by applying common transformations, e.g., rotating and scaling computer fonts.
Train a model using the augmented canonical examples.
Use to detect and discard outliers in the training set . An example is an outlier if has a low confidence on , the class for .
Not all AI tasks have canonical examples. For such tasks, we use all the examples to train a model, and then discard examples that have high training errors.
After removing outliers from the original training set, we get a sanitized set which is used to train a model, called sanitized model. Then, we evaluate if the sanitized model is more robust than unsanitized models using two metrics: classification accuracy and distortion of adversarial examples.
1.3 Detecting adversarial examples
We take advantage of the Kullback-Leibler divergence  between the outputs of the original and the sanitized models to depict the difference of sensitivity to the adversarial examples. The Kullback-Leibler divergence from a distribution to is defined as
By setting a proper threshold, we are able to detect nearly all adversarial examples with acceptable false reject rate. No modifications to the original model structure or other data outside the original dataset are required.
The detection method is hard to distinguish between adversarial examples and normal examples when the distortion of the adversarial image is very small. To address this problem, we designed a complete adversarial example detection framework. We will discuss the framework detailedly in Section 2.4.
2.1 Set up
We used two data sets, MNIST333http://yann.lecun.com/exdb/mnist/ and SVHN444http://ufldl.stanford.edu/housenumbers/, to evaluate our proposed method. We performed both canonical sanitization and self sanitization on MNIST and only self sanitization on SVHN. For SVHN, we pre-processed it with the following steps to get individual clean digit images. After the process, we obtained images from the original SVHN training set and test images from the original SVHN test set.
Cropping individual digits using the bounding boxes.
Discarding images whose either dimension is less than 10.
Resizing the larger dimension of each image to 28 while keeping the aspect ratio, and then padding the image to. When padding an image, we used the average color of the border as the padding color.
The models we used to train these two datasets are different. We designed Convolutional Neural Networks for MNIST and SVHN separately. Correspondingly, we achieved an accuracy ofand on the unsanitized models.
MNIST CNN: Input (Conv + Pool) * 2 FC FC Output
SVHN CNN: Input (Conv + Conv + Pool) * 3 FC FC Output
We did canonical sanitization on MNIST by discarding outliers that are far different from canonical examples. We chose 340 fonts containing digits as canonical examples. To accommodate variations in handwriting, we also augmented the fonts by scaling and rotation. After the augmentation, we acquired images, from which we randomly chose 80% as the training set and the remaining 20% as the test set. We trained the MNIST CNN on canonical examples and achieved an accuracy of 98.7%. We call this the canonical model.
We fed each example in MNIST training set to the canonical model. If the example’s confidence score of the correct class was below a threshold, we considered it an outlier and discarded it. Figure 1 shows examples with low and high confidence. Table 1 shows the number of examples left under different thresholds. We used these examples to train the sanitized models.
We did self sanitization on both MNIST and SVHN. To discard outliers in self sanitization, we trained the CNN for MNIST and SVHN separately, used the models to test every example in the training set, and considered examples whose confidence scores were below a threshold as outliers. Table 2 and Table 3 show the number of examples left under different thresholds. We used these examples to train the sanitized models. Table 3 also shows that the sanitized models maintain high classification accuracy when it has adequate training data to prevent overfitting.
|Threshold||Training set size||Classification accuracy (%)|
2.3 Robustness against adversarial examples
We ran the IGSM and Carlini & Wagner attacks on both the unsanitized and sanitized models.
Figure 2 compares the classification accuracy of the unsanitized and sanitized models on the adversarial examples generated by the IGSM attack on MNIST, where (a) and (b) correspond to canonical sanitization and self sanitization, respectively.
Figure 2 shows that a higher threshold of sanitization increases the robustness of the model against adversarial examples and maintains classfication accuracy on normal examples. For example, on adversarial examples generated after five iterations of IGSM, the classification accuracy is 82.8% with a threshold of in canonical sanitization, and is above 92.6% with a threshold of in self sanitization. For normal examples, the classfication accuracy is always higher than 95.0% in different threshold.
Carlini & Wagner’s attack
We ran Carlini & Wagner’s target attack to generate adversarial examples on our sanitized models for both MNIST and SVHN. Figure 3 and Figure 5 show that the sanitized models forced the adversarial examples to add larger distortions in order to fool the sanitized models. The higher the threshold, the larger the distortion.
From the experiments, we concluded that the original dataset with outliers has higher generalization ability but weaker robustness. With the sanitization, the model obtained much more robustness by only sacrificing limited generalization ability.
2.4 Detecting adversarial examples
We evaluated the effectiveness of using the Kullback-Leibler divergence to detect adversarial examples (Section 1.3).
We generated adversarial examples on two sanitized models on MNIST:
A canonical sanitized model. The discard threshold was set to be .
A self sanitized model. The discard threshold was set to be .
We computed the Kullback-Leibler divergence from the output of the unsanitized model to that of each of the sanitized models.
Figure 6 compares the CDF of Kullback-Leibler divergence between normal and adversarial examples generated by IGSM after different iterations. It shows that the majority of normal examples have very small divergence, while most adversarial examples have large divergence where more iterations generated examples with higher divergence. Figure 7 compares the CDF of Kullback-Leibler divergence between normal and adversarial examples generated by the Carlini & Wagner attack using different confidence levels, which is more prominent.
Table 4 shows the accuracy of detecting adversarial examples based on the KL divergence from the unsanitized model to a canonical sanitized model. We used a threshold of KL divergence to divide normal and adversarial examples: examples below this threshold were considered normal, and all above adversarial. We determined the threshold by setting a target detection accuracy on normal examples. For example, when we set this target accuracy to 98%, we needed a threshold of KL divergence of
. At this threshold, the accuracy of detecting all the Carlini & Wagner adversarial examples at all the confidence levels is all above 95%. The accuracy of detecting IGSM adversarial examples is high when the number of iterations is high (e.g., 10 or 15). When the number of iterations is low (e.g., 5), the detection accuracy decreases; however, since the false negative examples have low KL divergence, they are more similar to normal examples and therefore can be classified correctly with high probability as discussed next.
|Attack||IGSM (iterations)||Carlini & Wagner (confidence)|
Detection accuracy (%)
To take advantage of both the KL divergence for detecting adversarial examples and the sanitized models for classifying examples, we combined them into a framework shown in Figure 8. The framework consists of a detector, which computes the KL divergence from the unsantized model to the sanitized model and rejects the example if its divergence exceeds a threshold, and a classifier, which infers the class of the example using the sanitized model. The framework makes a correct decision on an example when
if the example is regard as normal, the classifier correctly infers its class.
if the example is adversarial, the detector decides the example as adversarial or the classifier correctly infers its true class.
Table 5 shows the accuracy of this system on adversarial exampled generated by the IGSM attack on a canonical sanitized model on MNIST. At each tested iteration of the IGSM attack, the accuracy of this system on adversarial examples is above 94%. The accuracy of this system on the normal examples is 94.8%.
|IGSM iterations||Accuracy (%)|
Figure 5 compares the CDF of the Kullback-Leibler divergence of normal examples and adversarial examples generated by the Carlini & Wagner attack at different confidence levels. We trained the sanitized model with a discard threshold 0.9 (self sanitization). We can see normal examples have small divergence, while all the Carlini & Wagner adversarial examples under difference confidence levels have large divergence.
Table 6 shows the impact of sanitization threshold on the detection accuracy on adversarial examples generated by the Carlini & Wagner attack. We automatically determined the threshold of KL divergence by setting the detection accuracy on normal examples to 94%. Table 6 shows that as the sanitization threshold increases from 0.7 to 0.9, the detection accuracy increases. However, after the sanitization threshold increases even further, the detection accuracy decreases. This is because after the sanitization threshold exceeds 0.9, the size of the training set decreases rapidly, which causes the model to overfit.
|Sanitization||Training||KL divergence||Detection accuracy (%)|
|threshold||set size||threshold||Attack confidence|
3 Discussion and future work
From the observation in Section 2.3 that the sanitized models will obtain higher robustness, we speculate the causation of this phenomenon is that the outliers will extend the decision boundary and give the model better generalization ability. However, since the outliers are usually not of big proportion and not representative, the extended decision boundary would also include adversarial examples. We call this phenomenon as ’negative generalization’.
The state-of-the-art techniques of handling the negative generalization problem are various advanced adversarial retraining methods, which use adversarial examples as a part of the training data and force the model to correctly classify. These methods are essentially enriching the proportion of the outliers and making the decision boundary more sophisticated to improve the robustness.
In this paper, we focus on another direction. By culling the dataset, we restrict the decision boundary thus also limit the negative generalization on the sanitized model. The sanitized model can help to make a gap of the negative generalization (sensitivity to adversarial examples) between itself and the original model, while the capability of classifying normal examples for both models would stay similar. This lets us leverage the gap to detect adversarial examples.
Section 2.4 showed that we can use the Kullback-Leibler divergence as a reliable metric to distinguish between normal and adversarial examples. In our future work, we plan to evaluate if the attacker can generate adversarial examples to evade our detection if she knows our detection method.
4 Related work
Most prior work on machine learning security focused on improving the network architecture, training method, or incorporating adversarial examples in training. By contrast, we focus on culling the training set to remove outliers to improve the model’s robustness.
4.1 Influence of training examples
Influence functions is a technique from robust statistics to measure the estimator on the value of one of the points in the sample[5, 18]. Koh et al. used influence functions as an indicator to track the behavior from the training data to the model’s prediction . By modifying the training data and observing its corresponding prediction, the influence functions can reveal insight of model. They found that some ambiguous training examples were effective points that led to a low confidence model. Influence Sketching  proposed a new scalable version of Cook’s distance [3, 4] to prioritize samples in the generalized linear model . The predictive accuracy changed slightly from 99.47% to 99.45% when they deleted 10% ambiguous examples from the training set.
4.2 Influence of test examples
Xu et al.  observed that most features of test examples are unnecessary for prediction, and that superfluous features facilitate adversarial examples. They proposed two methods to reduce the feature space: reducing the color depth of images and using smoothing to reduce the variation among pixels. Their feature squeezing defense successfully detected adversarial examples while maintaining high accuracy on normal examples.
Adversarial examples remain a challenging problem despite recent progress in defense. In this paper, we study the relationship between outliers in the data set and model robustness and propose a framework for detecting adversarial examples without modifying the original model architecture. We design two methods to detect and remove outliers in the training set and used the remaining examples to train a sanitized model. On both MNIST and SVHN, the sanitized models improved the classification accuracy on adversarial examples generated by the IGSM attack and increased the distortion of adversarial examples generated by the Carlini & Wagner attack, which indicates that the sanitized model is less sensitive to adversarial examples. Our detection is essentially leveraging the different sensitivity to adversarial examples of the model trained with and without outliers. We found that the Kullback-Leibler divergence from the unsanitized model to the sanitized model can be used to measure this difference and detect adversarial examples reliably.
This material is based upon work supported by the National Science Foundation under Grant No. 1801751.
-  Carlini, N., Wagner, D.: Adversarial examples are not easily detected: Bypassing ten detection methods. arXiv preprint arXiv:1705.07263 (2017)
-  Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: IEEE Symposium on Security and Privacy (2017)
Cook, R.D.: Detection of influential observation in linear regression. Technometrics19(1), 15–18 (1977)
-  Cook, R.D.: Influential observations in linear regression. Journal of the American Statistical Association 74(365), 169–174 (1979)
-  Cook, R.D., Weisberg, S.: Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22(4), 495–508 (1980)
-  Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (ICLR) (2015)
-  Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: International Conference on Machine Learning (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
-  Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics 22(1), 79–86 (1951)
-  Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. CoRR abs/1607.02533 (2016)
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
-  Madsen, H., Thyregod, P.: Introduction to general and generalized linear models. CRC Press (2010)
-  Meng, D., Chen, H.: MagNet: a two-pronged defense against adversarial examples. In: ACM Conference on Computer and Communications Security (CCS). Dallas, TX (2017)
-  Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. In: International Conference on Learning Representations (ICLR) (2017)
Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2574–2582 (2016)
-  Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE Symposium on Security and Privacy (2016)
-  Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Representations (ICLR) (2014)
-  Weisberg, S., Cook, R.D.: Residuals and influence in regression (1982)
-  Wojnowicz, M., Cruz, B., Zhao, X., Wallace, B., Wolff, M., Luan, J., Crable, C.: “influence sketching”: Finding influential samples in large-scale regressions. In: Big Data (Big Data), 2016 IEEE International Conference on. pp. 3601–3612. IEEE (2016)
-  Xu, W., Evans, D., Qi, Y.: Feature squeezing: Detecting adversarial examples in deep neural networks. In: Network and Distributed Systems Security Symposium (NDSS) (2018)