1 Introduction
Deep neural networks have been used with great success for perceptual tasks such as image classification
(Simonyan & Zisserman, 2014; LeCun et al., 2015) or speech recognition (Hinton et al., 2012). While they are known to be robust to random noise, it has been shown that the accuracy of deep nets can dramatically deteriorate in the face of socalled adversarial examples (Biggio et al., 2013; Szegedy et al., 2013; Goodfellow et al., 2014), i.e. small perturbations of the input signal, often imperceptible to humans, that are sufficient to induce large changes in the model output.A plethora of methods have been proposed to find adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2014; Kurakin et al., 2016; Moosavi Dezfooli et al., 2016; Sabour et al., 2015). These often transfer across different architectures, enabling blackbox attacks even for inaccessible models (Papernot et al., 2016; Kilcher & Hofmann, 2017; Tramèr et al., 2017). This apparent vulnerability is worrisome as deep nets start to proliferate in the realworld, including in safetycritical deployments.
The most direct and popular strategy of robustification is to use adversarial examples as data augmentation during training (Goodfellow et al., 2014; Kurakin et al., 2016; Madry et al., 2017). This improves robustness against specific attacks, yet does not address vulnerability to more cleverly designed counterattacks (Athalye et al., 2018; Carlini & Wagner, 2017a). This raises the question of whether one can protect models with regard to a wider range of possible adversarial perturbations.
A different strategy of defense is to detect whether or not the input has been perturbed, by detecting characteristic regularities either in the adversarial perturbations themselves or in the network activations they induce (Grosse et al., 2017; Feinman et al., 2017; Xu et al., 2017; Metzen et al., 2017; Carlini & Wagner, 2017a). In this spirit, we propose a method that measures how feature representations and logodds change under noise: If the input is adversarially perturbed, the noiseinduced feature variation tends to have a characteristic direction, whereas it tends not to have any specific direction if the input is natural. We evaluate our method against strong iterative attacks and show that even an adversary aware of the defense cannot evade our detector.
In summary, we make the following contributions:

We propose a statistical test for the detection and classification of adversarial examples.

We establish a link between adversarial perturbations and inverse problems, providing valuable insights into the feature space kinematics of adversarial attacks.

We conduct extensive performance evaluations as well as a range of experiments to shed light on aspects of adversarial perturbations that make them detectable.
2 Related Work
Iterative adversarial attacks. Adversarial perturbations are small specifically crafted perturbations of the input, typically imperceptible to humans, that are sufficient to induce large changes in the model output. Let
be a probabilistic classifier with logits
and let . The goal of the adversary is to find an norm bounded perturbation , where controls the attack strength, such that the perturbed sample gets misclassified by the classifier . Two of the most iconic iterative adversarial attacks are:Projected Gradient Descent (Madry et al., 2017) aka Basic Iterative Method (Kurakin et al., 2016):
(1)  
where the second and third line refer to the  and norm variants respectively, is the projection operator onto the set , is a small stepsize, is the target label and
is a suitable loss function. For untargeted attacks
and the sign in front of is flipped, so as to ascend the loss function.CarliniWagner attack (Carlini & Wagner, 2017b):
(2)  
where is an objective function, defined such that if and only if , e.g. (see Section V.A in (Carlini & Wagner, 2017b) for a list of objective functions with this property) and denotes the data domain, e.g. . The constant trades off perturbation magnitude (proximity) with perturbation strength (attack success rate) and is chosen via binary search.
Detection.
The approaches most related to our work are those that defend a machine learning model against adversarial attacks by detecting whether or not the input has been perturbed, either by detecting characteristic regularities in the adversarial perturbations themselves or in the network activations they induce
(Grosse et al., 2017; Feinman et al., 2017; Xu et al., 2017; Metzen et al., 2017; Song et al., 2017; Li & Li, 2017; Lu et al., 2017; Carlini & Wagner, 2017a).Notably, Grosse et al. (2017) argue that adversarial examples are not drawn from the same distribution as the natural data and can thus be detected using statistical tests. Metzen et al. (2017) propose to augment the deep classifier net with a binary “detector” subnetwork that gets input from intermediate feature representations and is trained to discriminate between natural and adversarial network activations. Feinman et al. (2017)
suggest to detect adversarial examples either via kernel density estimates in the feature space of the last hidden layer or via dropout uncertainty estimates of the classfier’s predictions, which are meant to detect if inputs lie in lowconfidence regions of the ambient space.
Xu et al. (2017) propose to detect adversarial examples by comparing the model’s predictions on a given input with the model’s predictions on a squeezed version of the input, such that if the difference between the two exceeds a certain threshold, the input is considered to be adversarial.Adversarial Examples.
It is still an open question whether adversarial examples exist because of intrinsic flaws of the model or learning objective or whether they are solely the consequence of nonzero generalization error and highdimensional statistics
(Gilmer et al., 2018; Schmidt et al., 2018; Fawzi et al., 2018), with adversarially robust generalization simply requiring more data than classical generalization (Schmidt et al., 2018).We note that our method works regardless of the origin of adversarial examples. Even if adversarial examples are not the result of intrinsic flaws, they still induce characteristic regularities in the feature representations of a neural net, e.g. under noise, and can thus be detected.
3 Identifying and Correcting Manipulations
3.1 Perturbed LogOdds
We work in a multiclass setting, where pairs of inputs and class labels are generated from a data distribution . The input may be subjected to an adversarial perturbation such that , forcing a misclassification. A wellknown defense strategy against such manipulations is to voluntarily corrupt inputs by noise before processing them. The rationale is that by adding noise , one may be able to recover the original class, if is sufficiently large. For this to succeed, one typically utilizes domain knowledge in order to construct meaningful families of random transformations, as has been demonstrated, for instance, in (Xie et al., 2017; Athalye & Sutskever, 2017)
. Unstructured (e.g. white) noise, on the other hand, does typically not yield practically viable tradeoffs between probability of recovery and overall accuracy loss.
We thus propose to look for more subtle statistics that can be uncovered by using noise as a probing instrument and not as a direct means of recovery. We will focus on probabilistic classifiers with a logit layer of scores as this gives us access to continuous values. For concreteness we will explicitly parameterize logits via
with classspecific weight vectors
on top of a feature map realized by a (trained) deep network. Note that typically . We also define pairwise logodds between classes and , given input(3) 
We are interested in the noiseperturbed logodds with , where , if ground truth is available, e.g. during training, or , during testing.
Note that the logodds may behave differently for different class pairs, as they reflect class confusion probabilities that are taskspecific and that cannot be anticipated a priori
. This can be addressed by performing a Zscore standardization across data points
and perturbations . For each fixed class pair define:(4)  
In practice, all of the above expectations are computed by sample averages over training data and noise instantiations.
3.2 LogOdds Robustness
The main idea pursued in this paper is that the robustness properties of the perturbed logodds statistics are different, dependent on whether is naturally generated or whether it is obtained through an (unobserved) adversarial manipulation, .
Firstly, note that it is indeed very common to use (smallamplitude) noise during training as a way to robustify models or to use regularization techniques which improve model generalization. In our notation this means that for , it is a general design goal – prior to even considering adversarial examples – that with high probability , i.e. that logodds with regard to the true class remain stable under noise. We generally may expect to be negative (favoring the correct class) and slightly increasing under noise, as the classifier may become less certain.
Secondly, we posit that for many existing deep learning architectures, common adversarial attacks find perturbations that are not robust, but that overfit to specifics of . We elaborate on this conjecture below by providing empirical evidence and theoretical insights. For the time being, note that if this conjecture can be reaonably assumed, then this opens up ways to design statistical tests to identify adversarial examples and possibly even to infer the true class label, which is particularly useful for test time attacks.
Consider the case of a test time attack, where we suspect an unknown perturbation has been applied such that . If the perturbation is not robust w.r.t. the noise process, then this will yield , meaning that noise will partially undo the effect of the adversarial manipulation and directionally revert the logodds towards the true class in a way that is statistically captured in the perturbed logodds. Figure 1 (lower left corner) shows this reversion effect.
3.3 Statistical Test
We propose to use the expected perturbed logodds as statistics to test whether classified as should be thought of as a manipulated example of (true) class or not. To that extent, we define thresholds , which guarantee a maximal false detection rate (of say 1%), yet maximize the true positive rate of identifying adversarial examples. We then flag an example as (possibly) manipulated, if
(5) 
otherwise it is considered clean.
3.4 Corrected Classification
For test time attacks, it may be relevant not only to detect manipulations, but also to correct them on the spot. The simplest approach is to define a new classifier via
(6) 
Here we have set , which sets the correct reference point consistent with Equation 5.
A somewhat more sophisticated approach is to build a second level classifier on top of the perturbed logodds statistics. We have performed experiments with training a logistic regression classifier for each class
on top of the standardized logodds scores , , . We have found this to further improve classification accuracy, especially in cases where several Zscores are comparably far above the threshold. See Section 7.1 in the Appendix for further details.4 Feature Space Analysis
4.1 Optimal Feature Space Manipulation
The feature space view allows us to characterize the optimal direction of manipulation for an attack targetting some class . Obviously the logodds only depend on a single direction in feature space, namely .
Proposition 1.
For constraint sets that are closed under orthogonal projections, the optimal attack in feature space takes the form for some .
Proof.
Assume is optimal. We can decompose , where . achieves the same change in logodds as and is also optimal. ∎
Proposition 2.
If s.t. and , then .
Proof.
Follows directly from linearity of logodds. ∎
Now, as we treat the deep neural network defining as black box device, it is difficult to state whether a (near)optimal feature space attack can be carried out by manipulating in the input space via . However, we will use some DNN phenomenology as a starting point for making reasonable assumptions that can advance our understanding.
4.2 PreImage Problems
The feature space view suggests to search for a preimage of the optimal manipulation or at least a manipulation such that is small. Such preimage problems are wellstudied in the field of robotics as inverse kinematics problems. A naïve approach would be to linearize at and use the Jacobian,
(7) 
Iterative improvements could then be obtained by inverting (or pseudoinverting) , but are known to be plagued by instabilities. A popular alternative is the socalled Jacobian transpose method (Buss, 2004; Wolovich & Elliott, 1984; Balestrino et al., 1984). This can be motivated by a simple observation
Proposition 3.
Given an input as well as a target direction in feature space. Define and assume that . Then there exist an (small enough) such that is a better preimage in that .
Proof.
Follows from Taylor expansion of . ∎
It turns out that by the chain rule, we get for any loss
defined in terms of features ,(8) 
With the softmax loss and in case of one gets
(9) 
This shows that a gradientbased iterative attack is closely related to solving the preimage problem for finding an optimal feature perturbation via the Jacobian transpose method.
4.3 Approximate Rays and Adversarial Cones
If an adversary could directly control the feature space representation, optimal attack vectors can always be found along the ray . As the adversary has to work in input space, this may only be possible in approximation: optimal manipulations may not lie on an exact ray and may not be perfectly colinear with . However, experimentally, we have found that an optimal perturbation typically defines a ray in input space, (), yielding a featurespace trajectory for which the rate of change along is nearly constant over a relevant range of (see Figures 3 & 9). As tangents are given by
(10) 
this means that . Although may fluctuate along feature space directions orthogonal to , making it not a perfect ray, the key characteristics is that there is steady progress in changing the relevant logodds. While it is obvious that the existence of such rays plays in the hand of an adversary, it remains an open theoretical question to eluciate properties of the model architecture, causing such vulnerabilities.
As adversarial directions are expected to be suscpetible to angular variations (otherwise they would be simple to find and pointing at a general lack of model robustness), we conjecture that geometrically optimal adversarial manipulations are embedded in a conelike structure, which we call adversarial cone. Experimental evidence for the existence of such cones is visualized in Figure 5. It is a virtue of the commutativity of applying the adversarial and random noise that our statistical test can reliably detect such adversarial cones.
5 Experimental Results
Dataset  Model  Test set accuracy 

(clean / pgd)  
CIFAR10  CNN7  93.8% / 3.91% 
WResNet  96.2% / 2.60%  
CNN4  73.5% / 14.5%  
ImageNet  Inception V3  76.5% / 7.2% 
ResNet 101  77.0% / 7.2%  
ResNet 18  69.2% / 6.5%  
VGG11(+BN)  70.1% / 5.7%  
VGG16(+BN)  73.3% / 6.1% 
5.1 Datasets, Architectures & Training Methods
In this section, we provide experimental support for our theoretical propositions and we benchmark our detection and correction methods on various architectures of deep neural networks trained on the CIFAR10 and ImageNet datasets. For CIFAR10, we compare a WideResNet implementation from Madry et al. (2017)
, a 7layer CNN with batch normalization and a vanilla 4layer CNN, details can be found in the Appendix. In the following, if nothing else is specified, we use the 7layer CNN as a default platform, since it has good test set accuracy at relatively low computational requirements. For ImageNet, we use a selection of models from the torchvision package
(Marcel & Rodriguez, 2010), including Inception V3, ResNet101 and VGG16.As a default attack strategy we use an norm constrained PGD whitebox attack. The attack budget was chosen to be the smallest value such that most examples are successfully attacked. For CIFAR10 this is , for ImageNet . We experimented with a number of different PGD iterations and found that the corrected classification accuracy is nearly constant across the entire range from 10 up to 1000 attack iterations. The result of this experiment can be found in Figure 10 in the Appendix. For the remainder of this paper, we thus fixed the number of iterations to be 20. Table 1 shows test set accuracies for all models on both clean and adversarial samples.
We note that the detection algorithm (based on Equation 5) is completely attack agnostic, while the logistic classifier based correction algorithm is trained on adversarially perturbed training samples, see Section 7.1 in the Appendix for further details. The secondlevel logistic classifier is the only stage where we explicitly include an adversarial attack model. While this could in principle lead to overfitting to the particular attack considered, we empirically show that the correction algorithm performs well under attacks not seen during training, see Section 5.6, as well as specifically designed counterattacks, see Section 5.7.

[width=0.46trim=0.0550pt 0.0650pt 0.040pt 0.0670pt,clip]ext_pdf/1 

[width=0.46trim=0.0550pt 0.0650pt 0.0380pt 0.0670pt,clip]ext_pdf/15 
5.2 Detectability of Adversarial Examples
Before we evaluate our method, we present empirical support for the claims made in Sections 3 and 4.
Induced feature space perturbations. We compute (i) the norm of the induced feature space perturbation along adversarial and random directions (the expected norm of the noise is set to be approximately equal to the expected norm of the adversarial perturbation). We also compute (ii) the alignment between the induced feature space perturbation and certain weightdifference vectors. For the adversarial direction, we compute the alignment with the weightdifference vector between the true and adversarial class. For the random direction, the largest alignment with any weightdifference vector is computed.
The results are reported in Figure 3. The plot on the left shows that iterative adversarial attacks induce feature space perturbations that are significantly larger than the feature space perturbations induced by random noise. Similarly, the plot on the right shows that the alignment of the attackinduced feature space perturbation is significantly larger than the alignment of the noiseinduced feature space perturbation. Combined, this indicates that adversarial examples lie in particular directions in input space in which small perturbations cause atypically large feature space perturbations along the weightdifference direction .
Distance to decision boundary. Next, we investigate whether adversarial examples are closer or farther from the decision boundary compared to their unperturbed counterpart. The purpose is to test whether adversarial examples could be detectable for the trivial reason that they are lying closer to the decision boundary than natural examples.
To this end, we measure (i) the logit crossover when linearly interpolating between an adversarially perturbed example and its natural counterpart, i.e. we measure
s.t. , where . We also measure (ii) the average norm of the DeepFool perturbation , required to cross the nearest decision boundary, for a given interpolant (the DeepFool attack tries to find the shortest path to the nearest decision boundary^{1}^{1}1We additionally augment DeepFool by a binary search to hit the decision boundary precisely.). With the second experiment we want to measure whether the adversarial example is closer to any decision boundary, not necessarily the one between the natural and adversarial example in part (i).
distance to db 
[width=0.57trim=0.070pt 0.060pt 0.10pt 0.110pt,clip]ext_pdf/8 
relative offset to logit crossover point 
Our results confirm that adversarial examples are not closer to the decision boundary than their natural counterparts. The mean logit crossovers is at . Similarly, as shown in Figure 4, the mean distance to the nearest decision boundary is for adversarial examples, compared to for natural examples. Hence, adversarial examples are even slightly farther from the decision boundary.
We can thus rule out the possibility that adversarial examples can be detected because of a discrepancy in distance to the decision boundary.
Proximity to nearest neighbor. We measure the ratio of the ‘distance between the adversarial and the corresponding unperturbed example’ to the ‘distance between the adversarial example and the nearest other neighbor (in either training or test set)’, i.e. we compute over a number of samples in the test set, for various  & bounded PGD attacks (with ).
We consistently find that the ratio is sharply peaked around a value much smaller than one. E.g. for PGD attack with we get , while for the corresponding PGD attack we obtain . Further values can be found in Table 7 in the Appendix. We note that similar findings have been reported before (Tramèr et al., 2017).
Hence, “perceptually similar” adversarial samples are much closer to the unperturbed sample than to any other neighbor in the training or test set.
We would therefore naturally expect that adversarial examples tend to be shifted to the unperturbed sample rather than any other neighbor when convolved with random noise. Although adding noise is generally not sufficient to cross the decision boundary, e.g. to restore the original class, the noiseinduced feature variation is more likely to shift to the original class than any other neighboring class.
Adversarial Cones. To visualize the ambient space neighborhood around natural and adversarially perturbed samples, we plot the averaged 2D projection of the classifier’s prediction for the natural class in a hyperplane spanned by the adversarial perturbation and randomly sampled orthogonal vectors, i.e. we plot for with along the horizontal and along the vertical axis, where denotes the softmax of and denotes expectation over random vectors with approximately equal norm.
Interestingly, the plot reveals that adversarial examples live in a conic neighborhood, i.e. the adversarial sample is statistically speaking “surrounded” by the natural class, as can be seen from the gray rays confining the adversarial cone. This confirms our proximity results and illustrates why the noiseinduced feature variation tends to have a direction that is indicative of the natural class when the sample is adversarially perturbed. See also Figure 9 in the Appendix.
Suboptimality & robustness to random noise. By virtue of the commutativity of applying the adversarial and random noise , the view that the adversarial perturbation is not robust to random noise, is dual to the view that is a suboptimal perturbation for the natural sample . To investigate this, we compute (i) the softmax predictions when adding noise to the adversarial example and (ii) the noiseinduced weightdifference alignment for natural and adversarial examples.
The results are reported in Figure 6. The plot on the left shows that the probability of the natural class increases significantly faster than the probability of the highest other class when adding noise with a small to intermediate magnitude to the adversarial example. Note, however, that the probability of the natural class never climbs to be the highest probability of all classes, which is why simple addition of noise to an adversarial example does not recover the natural class in general. The plot on the right shows that the noiseinduced weightdifference alignment is significantly larger for the adversarial example than for the natural example. This illustrates that the noise manages to partially undo the effect of the adversarial manipulation and directionally revert the features towards the true class. Combined, these results provide a direct justification to detect adversarial examples via the test statistic presented in Section 3.3.

[width=0.437trim=0.070pt 0.060pt 0.10pt 0.110pt,clip]ext_pdf/16 

[width=0.47trim=0.0550pt 0.070pt 0.0280pt 0.070pt,clip]ext_pdf/7 
5.3 Detection rates and classification accuracies
In the remainder of this section we present the results of various performance evaluations. The reported detection rates measure how often our method classifies a sample as being adversarial (corresponding to the False Positive Rate if the sample is clean and to the True Positive Rate if it was perturbed). We also report accuracies for the predictions made by the logistic classifier based correction method.
Tables 2 and 3 report the detection rates of our statistical test and accuracies of the corrected predictions. Our method manages to detect nearly all adversarial samples, seemingly getting better as models become more complex. All the while the false positive rate stays around . Further^{2}^{2}2Due to computational constraints, we focus on the CIFAR10 models in the remainder of this paper., after correcting, we manage to reclassify almost all of the detected adversarial samples to their respective source class successfully, resulting in test set accuracies on adversarial samples within of the respective test set accuracies on clean samples. Also note that due to the low false positive rate, the drop in performance on clean samples is negligible.
Dataset  Model  Detection rate 

(clean / pgd)  
CIFAR10  CNN7  0.8% / 95.0% 
WResNet  0.2% / 99.1%  
CNN4  1.4% / 93.8%  
ImageNet  Inception V3  1.9% / 99.6% 
ResNet 101  0.8% / 99.8%  
ResNet 18  0.6% / 99.8%  
VGG11(+BN)  0.5% / 99.9%  
VGG16(+BN)  0.3% / 99.9% 
Dataset  Model  Accuracy 

(clean / pgd)  
CIFAR10  CNN7  93.6% / 89.5% 
WResNet  96.0% / 92.7%  
CNN4  71.0% / 67.6% 
5.4 Effective strength of adversarial perturbations.
We measure how the detection and reclassification accuracy of our method depends on the attack strength. To this end, we define the effective Bernoulli strength of bounded adversarial perturbations as the attack success rate when each entry of the perturbation is individually accepted with probability and set to zero with probability . For we obtain the usual adversarial misclassification rate of the classifier . We naturally expect weaker attacks to be less effective but also harder to detect than stronger attacks.
The results are reported in Figure 7. We can see that the uncorrected accuracy of the classifier decreases monotonically as the attack strength increases, both in terms of the attack budget as well as in term of the fraction of accepted perturbation entries. Meanwhile, the detection rate of our method increases at such a rate that the corrected classifier manages to compensate for the decay in uncorrected accuracy, across the entire range considered.
detection rate / accuracy 
[width=0.48trim=0.050pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/5 
detection rate / accuracy 
[width=0.48trim=0.050pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/14 
Bernoulli attack strength q 
5.5 Comparing to Adversarial Training
For comparison, we also report test set and whitebox attack accuracies for adversarially trained models. Madry et al. (2017)’s WResNet was available as an adversarially pretrained variant, while the other models were adversarially trained as outlined in the Appendix 7.1. The results for the best performing classifier are shown in Table 4. We can see that adversarial training does not compare favorably to our method, as the accuracy on adversarial samples is significantly lower while the drop in performance on clean samples is considerably larger.
Adversarially  Accuracy 

trained model  (clean / pgd) 
CNN7  82.2% / 44.4% 
WResNet  87.3% / 55.2% 
CNN4  68.2% / 40.4% 
5.6 Defending against unseen attacks
Next, we evaluate our method on adversarial examples created by attacks that are different from the constrained PGD attack used to train the secondlevel logistic classifier. The rationale is that the logodds statistics of the unseen attacks could be different from the ones used to train the logistic classifier. We thus want to test whether it is possible evade correct reclassification by switching to a different attack. As alternative attacks we use an constrained PGD attack as well as the CarliniWagner attack.
The baseline accuracy of the undefended CNN7 on adversarial examples from the PGD attack is and for the CarliniWagner attack. Table 5 shows detection rates and corrected accuracies after our method is applied. As can be seen, there is only a slight decrease in performance, i.e. our method remains capable of detecting and correcting most adversarial examples of the previously unseen attacks.
Attack  Detection rate  Accuracy 

(clean / attack)  (clean / attack)  
1.0% / 96.1%  93.3% / 92.9%  
CW  4.8% / 91.6%  89.7% / 77.9% 
5.7 Defending against defenseaware attacks
Finally, we evaluate our method in a setting where the attacker is fully aware of the defense, in order to see if the defended network is susceptible to cleverly designed counterattacks. Since our defense is built on random sampling from noise sources that are under our control, the attacker will want to craft perturbations that perform well in expectation under this noise. The optimality of this strategy in the face of randomizationbased defenses was established in Carlini & Wagner (2017a). Specifically, we compute the expected adversarial attack over a noise neighborhood around , i.e. , with the same noise source as used for detection.
The undefended accuracies under this attack for the models under consideration are: CNN7 , WResNet and CNN4 . Table 6 shows the corresponding detection rates and accuracies after defending with our method. Compared to Section 5.6, the drop in performance is larger, as we would expect for a defenseaware counterattack, however, both the detection rates and the accuracies remain remarkably high compared to the undefended network.
Model  Detection rate  Accuracy 

(clean / attack)  (clean / attack)  
CNN7  2.8% / 75.5%  91.2% / 56.6% 
WResNet  4.5% / 71.4%  91.7% / 56.0% 
CNN4  4.1% / 81.3%  69.0% / 56.5% 
6 Conclusion
We have shown that adversarial examples exist in coneshaped regions in very specific directions from their corresponding natural samples. Based on this, we design a statistical test of a given sample’s logodds’ robustness to noise that can infer with high accuracy if the sample is natural or adversarial and recover its original class label, if necessary. Further research into the properties of network architectures is necessary to explain the underlying cause of this phenomenon. It remains an open question which current model families follow this paradigm and whether criteria exist which can certify that a given model is immunizable via this method.
Acknowledgements
We would like to thank Sebastian Nowozin, Aurelien Lucchi, Gary Becigneul, Jonas Kohler and the dalab team for insightful discussions and helpful comments.
References
 Athalye & Sutskever (2017) Athalye, A. and Sutskever, I. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017.
 Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
 Balestrino et al. (1984) Balestrino, A., De Maria, G., and Sciavicco, L. Robust control of robotic manipulators. IFAC Proceedings Volumes, 17(2):2435–2440, 1984.
 Biggio et al. (2013) Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Springer, 2013.
 Buss (2004) Buss, S. R. Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods. IEEE Journal of Robotics and Automation, 17(119):16, 2004.

Carlini & Wagner (2017a)
Carlini, N. and Wagner, D.
Adversarial examples are not easily detected: Bypassing ten detection
methods.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pp. 3–14. ACM, 2017a.  Carlini & Wagner (2017b) Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017b.
 Fawzi et al. (2018) Fawzi, A., Fawzi, H., and Fawzi, O. Adversarial vulnerability for any classifier. arXiv preprint arXiv:1802.08686, 2018.
 Feinman et al. (2017) Feinman, R., Curtin, R. R., Shintre, S., and Gardner, A. B. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
 Gilmer et al. (2018) Gilmer, J., Metz, L., Faghri, F., Schoenholz, S. S., Raghu, M., Wattenberg, M., and Goodfellow, I. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018.
 Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Grosse et al. (2017) Grosse, K., Manoharan, P., Papernot, N., Backes, M., and McDaniel, P. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
 Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Kilcher & Hofmann (2017) Kilcher, Y. and Hofmann, T. The best defense is a good offense: Countering black box attacks by predicting slightly wrong labels. arXiv preprint arXiv:1711.05475, 2017.
 Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
 Li & Li (2017) Li, X. and Li, F. Adversarial examples detection in deep networks with convolutional filter statistics. In ICCV, pp. 5775–5783, 2017.
 Lu et al. (2017) Lu, J., Issaranon, T., and Forsyth, D. A. Safetynet: Detecting and rejecting adversarial examples robustly. In ICCV, pp. 446–454, 2017.
 Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

Marcel & Rodriguez (2010)
Marcel, S. and Rodriguez, Y.
Torchvision the machinevision package of torch.
In Proceedings of the 18th ACM international conference on Multimedia, pp. 1485–1488. ACM, 2010.  Metzen et al. (2017) Metzen, J. H., Genewein, T., Fischer, V., and Bischoff, B. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.

Moosavi Dezfooli et al. (2016)
Moosavi Dezfooli, S. M., Fawzi, A., and Frossard, P.
Deepfool: a simple and accurate method to fool deep neural networks.
In
Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, number EPFLCONF218057, 2016.  Papernot et al. (2016) Papernot, N., McDaniel, P., and Goodfellow, I. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
 Sabour et al. (2015) Sabour, S., Cao, Y., Faghri, F., and Fleet, D. J. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122, 2015.
 Schmidt et al. (2018) Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285, 2018.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2014.
 Song et al. (2017) Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman, N. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
 Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Tramèr et al. (2017) Tramèr, F., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.
 Wolovich & Elliott (1984) Wolovich, W. A. and Elliott, H. A computational technique for inverse kinematics. In Decision and Control, 1984. The 23rd IEEE Conference on, volume 23, pp. 1359–1363. IEEE, 1984.
 Xie et al. (2017) Xie, C., Wang, J., Zhang, Z., Ren, Z., and Yuille, A. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017.
 Xu et al. (2017) Xu, W., Evans, D., and Qi, Y. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
7 Appendix
7.1 Experiments.
Further details regarding the implementation:
Details on the models used. All models on ImageNet are taken as pretrained versions from the torchvision^{3}^{3}3https://github.com/pytorch/vision python package. For CIFAR10, both CNN7^{4}^{4}4https://github.com/aaronxichen/pytorchplayground as well as WResNet^{5}^{5}5https://github.com/MadryLab/cifar10_challenge are available on GitHub as pretrained versions. The CNN4 model is a standard deep convolutional network with layers of 32, 32, 64 and 64 channels, each using
filters and each layer being followed by a ReLU nonlinearity and
MaxPooling. At the end is a single fully connected softmax classifier.Training procedures.
We used pretrained versions of all models except CNN4, which we trained for 50 epochs with RMSProp and a learning rate of 0.0001. For adversarial training, we trained for 50, 100 and 150 epochs using mixed batches of clean and corresponding adversarial (PGD) samples, matching the respective training schedule and optimizer settings of the clean models and then we chose the best performing classifier. The exception to this is the WResNet model, which has an already provided adversarially trained version.
Setting the thresholds. The thresholds are set such that our statistical test achieves the highest possible detection rate (aka True Positive Rate) at a prespecified False Positive Rate of less than ( for Sections 5.6 and 5.7), computed on a holdout set of natural and adversarially perturbed samples.
Determining attack strengths. For the adversarial attacks we consider, we can choose multiple parameters to influence the strength of the attack. Usually, as attack strength increases, at some point there is a sharp increase in the fraction of samples in the dataset where the attack is successful. We choose our attack strength such that it is the lowest value that is after this increase, which means that it is the lowest value such that the attack is still able to successfully attack most of the datapoints. Note that weaker attacks generate adversarial samples that are closer to the original samples, which makes them harder to detect than excessively strong attacks.
Noise sources. Adding noise provides a nonatomic view, probing the classifiers output in an entire neighborhood around the input. In practice we sample noise from a mixture of different sources: Uniform, Bernoulli and Gaussian noise with different magnitudes. The magnitudes are sampled from a logscale. For each noise source and magnitude, we draw 256 samples as base for noisy versions of the incoming datapoints, though we have no observed a large drop in performance using only the single best combination of noise source and magnitude and using less samples, which speeds up the wall time used to classify a single sample by an order of magnitude. For detection, we test the sample in question against the distribution of each noise source, then we take a majority vote as to whether the sample should be classified as adversarial.
Plots.
All plots containing shaded areas have been repeated over the dataset. In this case, the line indicates the mean measurement and the shaded area represents one standard deviation around the mean.
Wall time performance. Since for each incoming sample at test time, we have to forward propagate a batch of noisy versions through the model, the time it takes to classify a sample in a robust manner using our method scales linearly with compared to the same model undefended. The rest of our method has negligible overhead. At training time, we essentially have to do the same thing to the training dataset, which, depending on its size and the number of desired noise sources, can take a while. But for a given model and dataset, it has to be computed only once and the computed statistics can then be stored.
7.2 Logistic classifier for reclassification.
Instead of selecting class according to Eq. (6), we found that training a simple logistic classifier that gets as input all the Zscores for can further improve classification accuracy, especially in cases where several Zscores are comparably far above the threshold. Specifically, for each class label , we train a separate logsitic regression classifier such that if a sample of predicted class is detected as adversarial, we obtain the corrected class label as . These classifiers are trained on the same training data that is used to collect the statistics for detection. Two points worth noting: First, as the classifiers are trained using adversarial samples from a particular adversarial attack model, they might not be valid for adversarial samples from other attack models. However, we observe experimentally that our classifiers (trained using PGD) generalize well to other attacks. Second, building a classifier in order to protect a classifier might seem tautological, because this metaclassifier could now become the target of an adversarial attack itself. However, this does not apply in our case, as the inputs to our classifier are (i) lowdimensional (there are just
weightdifference alignments for any given sample), (ii) a product of sampled noise and therefore random variables and (iii) the classifier itself is shallow. All of these make it much harder to specifically attack this classifier. Further, in Section
5.7 we show that our method still performs reasonably well even if the adversary is aware of the defense.7.3 Additional results mentioned in the main text.
[width=0.62trim=0.070pt 0.060pt 0.10pt 0.110pt,clip]ext_pdf/3 
[width=0.6trim=0.050pt 0.080pt 0.030pt 0.070pt,clip]ext_pdf/10 
PGD  

7.4 ROC Curves.
Figure 11 shows how our method performs against a PGD attack under different settings of thresholds .
CNN7  CNN7  
accuracy on pgd samples 
[width=0.45trim=0.060pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/12 
true positive detection rate 
[width=0.45trim=0.060pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/11 
accuracy on clean samples  false positive detection rate  
WResNet  WResNet  
accuracy on pgd samples 
[width=0.45trim=0.060pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/4 
true positive detection rate 
[width=0.45trim=0.060pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/13 
accuracy on clean samples  false positive detection rate  
CNN4  CNN4  
accuracy on pgd samples 
[width=0.45trim=0.060pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/9 
true positive detection rate 
[width=0.45trim=0.060pt 0.0850pt 0.020pt 0.0730pt,clip]ext_pdf/0 
accuracy on clean samples  false positive detection rate 
Comments
There are no comments yet.