With Friends Like These, Who Needs Adversaries?

07/11/2018 ∙ by Saumya Jetley, et al. ∙ University of Oxford 0

The vulnerability of deep image classification networks to adversarial attack is now well known, but less well understood. Via a novel experimental analysis, we illustrate some facts about deep convolutional networks (DCNs) that shed new light on their behaviour and its connection to the problem of adversaries, with two key results. The first is a straightforward explanation of the existence of universal adversarial perturbations and their association with specific class identities, obtained by analysing the properties of nets' logit responses as functions of 1D movements along specific image-space directions. The second is the clear demonstration of the tight coupling between classification performance and vulnerability to adversarial attack within the spaces spanned by these directions. Prior work has noted the importance of low-dimensional subspaces in adversarial vulnerability: we illustrate that this likewise represents the nets' notion of saliency. In all, we provide a digestible perspective from which to understand previously reported results which have appeared disjoint or contradictory, with implications for efforts to construct neural nets that are both accurate and robust to adversarial attack.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Those studying deep networks find themselves forced to confront an apparent paradox. On the one hand, there is the demonstrated success of networks in learning class distinctions on training sets that seem to generalise well to unseen test data. On the other, there is the vulnerability of the very same networks to adversarial perturbations that produce dramatic changes in class predictions despite being counter-intuitive or even imperceptible to humans. A common understanding of the issue can be stated as follows: “While deep networks have proven their ability to distinguish between their target classes so as to generalise over unseen natural variations, they curiously possess an Achilles heel which must be defended.” In fact, efforts to formulate attacks and counteracting defences of networks have led to a dedicated competition nips2017competition and a body of literature already too vast to summarise in total.

In the current work we attempt to demystify this phenomenon at a fundamental level. We base our work on the geometric decision boundary analysis of moosavi-dezfooli2018robustness , which we reinterpret and extend into a framework that we believe is simpler and more illuminating with regards to the aforementioned paradoxical behaviour of deep convolutional networks (DCNs) for image classification. Through a fairly straightforward set of experiments and explanations, we clarify what it is that adversarial examples represent, and indeed, what it is that modern DCNs do and do not currently do. In doing so, we tie together work which has focused on adversaries per se with other work which has sought to characterise the feature spaces learned by these networks.

Let represent vectorised input images and

be the average vector-image over a given dataset. Then, the mean-normalised version of the dataset is denoted by

, where the image . We define the perturbation of the image in the direction as: , where is the perturbation scaling factor and is the unit-norm vector in the direction . The image is fed through a network parameterised by and the output score222The output of the layer just before the softmax operation, commonly known as the logit layer. for a specific class is given by . This class-score function can be rewritten as , which we equivalently denote by . Our work examines the nature of as a function of movement in specific image-space directions starting from randomly sampled natural images , for a variety of classification DCNs. With this novel analysis, we uncover three noteworthy observations about these functions that relate directly to the phenomenon of adversarial vulnerability in these nets, all of which are on display in Fig. 1. We now discuss these observations in more detail.

Figure 1: Plots of the ‘frog’ class score for the Network-in-Network lin2013network architecture trained on CIFAR10, associated with two specific image-space directions and

respectively. These directions are visualised as 2D images in the row below; the method of estimating them is explained in Sec. 

3. Each plot corresponds to a randomly selected CIFAR10 test image . Adding or subtracting components along

causes the network to change its prediction to ‘frog’: as can be seen, a ‘deer’ with a mild diamond striping added to it gets classified as a ‘frog’. This happens with little regard for the choice of input image

itself. Likewise, perturbations along change any ‘frog’ to a ‘non-frog’ class: notice the predicted labels for the sample images along the red curve in the second plot. These class-transition phenomena are predicted by the framework developed in this paper. While simplistic functions along directions and are used by the network to accomplish the task of classification, perturbations along the very same directions constitute adversarial attacks.

Before we begin, we note that these directions are obtained via the method explained in Sec. 3 and by design exhibit either positive or negative association with a specific class. In Fig. 1 we study two such directions for the ‘frog’ class: similar directions exist for all other classes. Firstly, notice that the score of the corresponding class (‘frog’, in this case) as a function of is often approximately symmetrical about some point , i.e. , and monotonic in both half-lines. This means that simply increasing the magnitude of correlation between the input image and a single direction causes the net to believe that more (or less) of the class is present. In other words, the image-space direction sends all images either towards or away from the class . In the former scenario, the direction represents a class-specific universal adversarial perturbation (UAP). Second, let , and let be the projection of onto the space normal to , such that . Then, our results illustrate that there exists a basis of image space containing such that the class-score function is approximately additively separable i.e.  for some functions and . This means that the directions under study can be used to alter the nets’ predictions almost independently of each other. However, despite these facts, their 2D visualisation reveals low-level structures that are devoid of a clear semantic link to the associated classes, as shown in Fig. 1. Thus, we demonstrate that the learned functions encode a more simplistic notion of class identity than DCNs are commonly assumed to represent, albeit one that generalises to the test distribution to an extent. Unsurprisingly, this does not align with the way in which the human visual system makes use of these data dimensions: ‘adversarial vulnerability’ is simply the name given to this disparity and the phenomena derived from it, with the universal adversarial perturbations of moosavi2017universal being a particularly direct example of this fact.

Finally, we show that nets’ classification performance and adversarial vulnerability are inextricably linked by the way they make use of the above directions, on a variety of architectures. Consequently, efforts to improve robustness by “suppressing” nets’ responses to components in these directions (e.g. gao2017deepcloak ) cannot simultaneously retain full classification accuracy. The features and functions thereof that DCNs currently rely on to solve the classification problem are, in a sense, their own worst adversaries.

2 Related Work

2.1 Fundamental developments in attack methods

Szegedy et al. coined the term ‘adversarial example’ in szegedy2014intriguing , demonstrating the use of box-constrained L-BFGS to estimate a minimal -norm additive perturbation to an input image to cause its label to change to a target class while keeping the resulting image within intensity bounds. Strikingly, they locate a small-norm (imperceptible) perturbation at every point, for every network tested. Further, the adversaries thus generated are able to fool nets trained differently to one another, even when trained with different subsets of the data. Goodfellow et al. goodfellow2015explaining subsequently proposed the ‘fast gradient sign method’ (FGSM) to demonstrate the effectiveness of the local linearity assumption in producing the same result, calculating the gradient of the cost function and perturbing with a fixed-size step in the direction of its sign (optimal under the linearity assumption and an -norm constraint). The DeepFool method of Moosavi-Dezfooli et al. moosavi2016deepfool retains the first-order framework of FGSM, but tailors itself precisely to the goal of finding the perturbation of minimum norm that changes the class label of a given natural image to any label other than its own. Through iterative attempts to cross the nearest (linear) decision boundary by a tiny margin, this method records successful perturbations with norms that are even smaller than those of goodfellow2015explaining . In moosavi2017universal , Moosavi-Dezfooli & Fawzi et al. propose an iterative aggregation of DeepFool perturbations that produces “universal” adversarial perturbations: single images which function as adversaries over a large fraction of an entire dataset for a targeted net. While these perturbations are typically much larger than individual DeepFools, they do not correspond to human perception, and indicate that there are fixed image-space directions along which nets are vulnerable to deception independently of the image-space locations at which they are applied. They also demonstrate some generalisation over network architectures.

Sabour & Cao et al. sabour2016manipulation pose an interesting variant of the problem: instead of “label adversaries”, they target “feature adversaries” which minimise the distance from a particular guide image in a selected network feature space, subject to a constraint on the -norm of image-space distance from a source image. Despite this constraint, the adversarial image mimics the guide very closely: not only is it nearly always assigned to the guide’s class, but it appears to be an inlier with respect to the guide-class distribution in the chosen feature space. Finally, while “adversaries” are conceived of as small perturbations applied to natural images such that the resulting images are still recognisable to humans, the “fooling images” of Nguyen et al. nguyen2015deep

are completely unrecognisable to humans and yet confidently predicted by deep networks to be of particular classes. Such images are easily obtained by both evolutionary algorithms and gradient ascent, under direct encoding of pixel intensities (appearing to consist mostly of noise) and under CPPN 

stanley2007compositional -regularised encoding (appearing as abstract mid-level patterns).

2.2 Analysis of adversarial vulnerability and proposed defences

In wang2016theoretical , Wang et al. propose a nomenclature and theoretical framework with which to discuss the problem of adversarial vulnerability in the abstract, agnostic of any actual net or attack thereof. They denote an oracle relative to whose judgement robustness and accuracy must be assessed, and illustrate that a classifier can only be both accurate and robust (invulnerable to attack) relative to its oracle if it learns to use exactly the same feature space that the oracle does. Otherwise, a network is vulnerable to adversarial attack in precisely the directions in which its feature space departs from that of the oracle. Under the assumption that a net’s feature space contains some spurious directions, Gao et al. gao2017deepcloak

propose a subtractive scheme of suppressing the neuronal activations (

i.e. feature responses) which change significantly between the natural and adversarial inputs. Notably, the increase in robustness is accompanied by a loss of performance accuracy. An alternative to network feature suppression is the compression of input image data explored in e.g. maharaj2015improving ; das_jpegcompression ; xie_randomisation .

Goodfellow et al. goodfellow2015explaining hypothesise that the high dimensionality and excessive linearity of deep networks explain their vulnerability. Tanay and Griffin tanay2016boundary begin by taking issue with the above via illustrative toy problems. They then advance an explanation based on the angle of intersection of the separating boundary with the data manifold which rests on overfitting and calls for effective regularisation - which they note is neither solved nor known to be solvable for deep nets. A variety of training-based goodfellow2015explaining ; madry2017towards ; lu2017safetynet ; metzen2017detecting methods are proposed to address the premise of the preceding analyses. Hardening methods goodfellow2015explaining ; madry2017towards investigate the use of adversarial examples to train more robust deep networks. Detection-based methods lu2017safetynet ; metzen2017detecting

view adversaries as outliers to the training data distribution and train detectors to identify them as such in the intermediate feature spaces of nets. Notably, these methods 

lu2017safetynet ; metzen2017detecting have not been evaluated on the feature adversaries of Sabour & Cao et al. sabour2016manipulation . Further, data augmentation schemes such as that of Zhang et al. zhang2018mixup , wherein convex combinations of input images are mapped to convex combinations of their labels, attempt to enable the nets to learn smoother decision boundaries. While their approach zhang2018mixup offers improved resistance to single-step gradient sign attacks, it is no more robust to iterative attacks of the same type.

Over the course of the line of work in moosavi-dezfooli2018robustness , fawzi2016semirandom , fawzi2017classification , and fawzi2017robustness , the authors build up an image-space analysis of the geometry of deep networks’ decision boundaries, and its connection with adversarial vulnerability. In fawzi2017classification , they note that the DeepFool perturbations of moosavi2016deepfool tend to evince relatively high components in the subspace spanned by the directions in which the decision boundary has a high curvature. Also, the sign of the mean curvature of the decision boundary in the vicinity of a DeepFooled image is typically reversed with respect to that of the corresponding natural image, which provides a simple scheme to identify and undo the attack. They conclude that a majority of image-space directions correspond to near-flatness of the decision boundary and are insensitive to attack, but along the remaining directions, those of significant curvature, the network is indeed vulnerable. Further, the directions in question are observed to be shared over sample images. They illustrate in moosavi-dezfooli2018robustness  why a hypothetical network which possessed this property would theoretically be predicted to be vulnerable to universal adversaries, and note that the analysis suggests a direct construction method for such adversaries as an alternative to the original randomised iterative approach of moosavi2017universal : they can be constructed as random vectors in the subspace of shared high-curvature dimensions.

3 Method

The analysis begins as in moosavi-dezfooli2018robustness , with the extraction of the principal directions and principal curvatures of the classifier’s image-space class decision boundaries. Put simply, a principal direction vector and its associated principal curvature tell you how much a surface curves as you move along it in a particular direction, from a particular point. Now, it takes many decision boundaries to characterise the classification behaviour of a multiclass net: for a -class classifier. However, in order to understand the boundary properties that are useful for discriminating a given class from all others, it should suffice to analyse only the 1-vs.-all decision boundaries. Thus, for each class , the method proceeds by locating samples very near to the decision boundary  between and the union of remaining classes . In practice, for each sample, this corresponds to the decision boundary between and the closest neighbouring class , which is arrived at by perturbing the sample from the latter (“source”) to the former (“target”). Then, the geometry of the decision boundary is estimated as outlined in Alg. 1 below333For more discussion about the implementation and associated concepts, refer to Appendix Sec. A.2., closely following the approach of moosavi-dezfooli2018robustness :

Algorithm 1 Computes mean principal directions and principal curvatures for a net’s image-space decision surface. network class score function , dataset , target class label principal curvature basis matrix and corresponding principal curvature-score vector procedure PrincipalCurvatures(, , )       null      for each sample s.t.  do           network predicts to be of class          : define as Hessian of function subscripts select class scores           DeepFool(, ) approximate nearest boundary point to           accumulate Hessian at sample boundary point             normalise mean Hessian by number of samples       = Eigs()

compute eigenvectors and eigenvalues of mean Hessian


The authors of moosavi-dezfooli2018robustness advance a hypothesis connecting positively curved directions with the universal adversarial perturbations of moosavi2017universal . Essentially, they demonstrate that if the normal section of a net’s decision surface along a given direction can be locally bounded on the outside by a circular arc of a particular positive curvature in the vicinity of a sample image point, then geometry accordingly dictates an upper bound on the distance between that point and the boundary in that direction. If such directions and bounds turn out to be largely common across sample image points (which they do), then the existence of universal adversaries follows directly, with higher curvature implying lower-norm adversaries. This argument is depicted visually in Appendix Sec. A.1. It is from this point that we move beyond the prior art and begin an iterative loop of further analysis, experimentation, and demonstration, as follows.

4 Experiments and Analysis

Provided only that the second-order boundary approximation holds up well over a sufficiently wide perturbation range and variety of images, the model implies that the distance of such adversaries from the decision boundary should increase as a function of their norm. Also, the attack along any positively curved direction should in that case be associated with the corresponding target class: the class in the call to Alg. 1. And while positively curved directions may be of primary interest in moosavi-dezfooli2018robustness , the extension of the above geometric argument to the negative-curvature case points to an important corollary: as sufficient steps along positive-curvature directions should perturb increasingly into class , so should steps along negative-curvature directions perturb increasingly away from class . Finally, the plethora of approximately zero-curvature (flat) directions identified in fawzi2017classification ; moosavi-dezfooli2018robustness should have negligible effect on class identity.

Figure 2: Selected class scores plotted as functions of the scaling factor of the perturbation along the most positively curved direction per net. The ‘Median class score’ plot compares the score of a randomly selected target class with the supremum of the scores for the non-target classes. Each curve represents the median of the class scores over the associated dataset, bracketed below by the 30th-percentile score and above by the 70th. The ‘Transition into target class’ plot depicts the fraction of the dataset not originally of the target class, but which is transitioned into the target class by the perturbation. Alongside, we graph that population’s median softmax target-class score. The black dashed line represents the fraction of the population originally of the target class that remains in the target class under the perturbation. The image grid on the right illustrates the 2D visualisations of the two most positively curved directions for randomly selected target classes: the columns correspond, from left to right, with the four net-dataset pairs under study. To observe class scores as functions of the norms of the perturbations along the most negatively curved and flat directions, refer to Appendix Sec. A.3.

4.1 Class identity as a function of the component in specific image-space directions

To test how well the above conjectures hold in practice, we graph statistics of the target and non-target class scores over the dataset as a function of the magnitude of the perturbation applied in directions identified as above. The results are depicted in Fig. 2, in which the predicted phenomena are readily evident. Along the selected positive-curvature directions, as the perturbation magnitude increases (with either sign), the population’s target class score approaches and then surpasses the highest non-target class score. The monotonicity of this effect is laid bare by graphing the fraction of non-target samples perturbed into the target class, alongside the median target class softmax score. Note, again, that the link between the directions in question and the target class identity is established a priori by Alg. 1. We continue in the Appendix and show that, as predicted, the same phenomenon is evident in reverse when using negative-curvature directions instead: see Fig. 7. All that changes is that it is the population’s non-target class scores that overtake its target class score with increasing perturbation norm, with natural samples of the target class accordingly being perturbed out of it. We also illustrate the point that flatness of the decision boundary manifests as flatness of both target and non-target class scores: over a wide range of magnitudes, these directions do not influence the network in any way. While Fig. 2 illustrates these effects at the level of the population, Fig. 1 shows a disaggregation into individual sample images, with one response curve per sample from a large set. The population-level trends remain evident, but another fact becomes apparent: empirically, the shapes of the curves change very little between most samples. They shift vertically to reflect the class score contribution of the orthonormal components, but they themselves do not otherwise much depend on those components. That is to say that at least some key components are approximately additively separable from one another. This fact connects directly to the fact that such directions are “shared” across samples in the first place, and thus identifiable by Alg. 1.

A more intuitive picture of what the networks are actually doing begins to emerge: they are identifying the high-curvature image-space directions as features associated with respective class identities, with the curvature magnitude representing the sensitivity of class identity to the presence of that feature. But if this is true, it suggests that what we have thus identified are actually the directions which the net relies on generally in predicting the classes of natural images, with the curvatures-cum-sensitivities representing their relative weightings. Accordingly, it should be possible to disregard the “flat” directions of near-zero curvature without any noticeable change in the network’s class predictions.

4.2 Network classification performance versus effective data dimensionality

To confirm the above hypothesis regarding the relative importance of different image-space directions for classification, we plot the training and test accuracies of a sample of nets as a function of the subspace onto which their input images are projected. The input subspace is parametrised by a dimensionality parameter , which controls the number of basis vectors selected per class. We use four variants of selection: the most positively curved directions per class (yielding the subspace ); the most negatively curved directions per class (yielding the subspace ; the union of the previous two (subspace ); and the least curved (flattest) directions per class (subspace ). The subspace so obtained is represented by the orthonormalised basis matrix

(obtained by QR decomposition of the aggregated directions), and each input image

is then projected444The mean training-set orthogonal component can be added, but is approximately in practice for data normalised by mean subtraction, as is the case here. onto as . Accuracies on as a function of are shown in the top row of Fig. 3.

Figure 3: Top row: Training and test classification accuracies for various DCNs on image sets projected onto the subspaces described in Sec. 4.2, as a function of their dimensionality parameter (from until the input space is fully spanned). The principal directions defining the subspaces are obtained by applying Alg. 1 once for each possible choice of target class and retaining directions per class. Note the relationship between the ordering of curvature magnitudes and classification accuracy by comparing the curves to the others. Bottom row: Mean -norms of various adversarial perturbations (DeepFool moosavi2016deepfool , FGSM goodfellow2015explaining and UAP moosavi2017universal ) and saliency maps simonyan2013visualising when projected onto the same subspaces as above, as a fraction of their original norms.
Figure 4:

Classification accuracies on image sets projected onto subspaces of the spans of their corresponding DeepFool perturbations. For each net-dataset pair, DeepFool perturbations are computed over the image set and assembled into a matrix that is decomposed into its SVD. The singular vectors are ordered as per their singular values:

represents the high-to-low ordering, the low-to-high, and the number of vectors retained. Compare this figure to Fig. 3 (while noticing how

now counts the total number of directions). For the ImageNet experiments, owing to memory constraints, the SVD is performed on downsampled DeepFools of size

and , respectively. The resulting singular vectors span the entire effective classification space of correspondingly downsampled images. This is evinced by the fact that the classification accuracy of images projected onto the singular vectors’ subspace saturates to the same performance as that yielded when the net is tested directly on the downsampled images.
Figure 5: Blue curves depict the mean -norms of "confined DeepFool" perturbations: those that are calculated under strict confinement to the respective subspaces of Fig. 4, also detailed in Sec. 4.3. Note the differences in scale of the -axes of the different plots. For MNIST and CIFAR, we also plot (in red) the mean norms of the projections of the input images onto those subspaces: observe the inverse relationship between the two curves. The columns on the right visualise, from top to bottom, sample images at the indicated points on the curves in the CIFAR100-AlexNet plots, from left to right: blue-bordered images represent confined DeepFool perturbations (rescaled for display), with their red-bordered counterparts displaying the projection of the corresponding sample CIFAR image onto the same subspace. Observe that when the human-recognisable object appearance is captured in any given subspace, the corresponding DeepFool perturbation becomes maximally effective (i.e. small-norm). Likewise, when the projected image is not readily recognisable to a human, the DeepFool perturbation is large. The feature space per se does not account for adversariality: the issue is in the net’s response to the features.

The outcome is striking: it is evident that in many cases, classification decisions have effectively already been made based on a relatively small number of features, corresponding to the most curved directions. The sensitivity of the nets along these directions, then, is clearly learned purposefully from the training data, and does largely generalise in testing, as seen. Note also that at this level of analysis, it essentially does not matter whether positively or negatively curved directions are chosen. Another important point emerges here. Since it is the high-curvature directions that are largely responsible for determining the nets’ classification decisions, the nets should be vulnerable to adversarial attack along precisely these directions.

4.3 Link between classification and adversarial directions

It has already been noted in fawzi2017classification that adversarial attack vectors evince high components in subspaces spanned by high-curvature directions. We expand the analysis by repeating the procedure of Sec. 4.2 for various attack methods, to determine whether existing attacks are indeed exploiting the directions in accordance with the classifier’s reliance on them. Results are displayed in the bottom row of Fig. 3, and should be compared against the row above. The graphs in these figures illustrate the direct relationship between the fraction of adversarial norm in given subspaces and the corresponding usefulness of those subspaces for classification. The inclusion of the saliency images of simonyan2013visualising alongside the attack methods makes explicit the fact that adversaries are themselves an exposure of the net’s notion of saliency.

By now, two results hint at a simpler and more direct way of identifying bases of classification/adversarial directions. First, a close inspection of the class-score curves sampled and displayed in Fig. 1 reveals a direct connection between the curvature of a direction near the origin and its derivative magnitude over a fairly large interval around it. Second, this observation is made more clear in Fig. 3 where it can be seen that the directions obtained by boundary curvature analysis in Alg. 1 correspond to the directions exploited by various first-order methods. Thus, we hypothesise that to identify such a basis, one need actually only perform SVD on a matrix of stacked class-score gradients555In fact, this analysis is begun in moosavi2017universal , but only the singular values are examined.. Here, we implement this using a collection of DeepFool perturbations to provide the required gradient information, and repeat the analysis of Sec. 4.2, using singular values to order the vectors. The results, in Fig. 4, neatly replicate the previously seen classification accuracy trends for high-to-low and low-to-high curvature traversal of image-space directions. Henceforth, we use these directions directly, simplifying analysis and allowing us to analyse ImageNet networks.

While Fig. 3 displays the magnitudes of components of pre-computed adversarial perturbations in different subspaces, we also design a variation on the analysis to illustrate how effective an efficient attack method (DeepFool) is when confined to the respective subspaces. This is implemented by simply projecting the gradient vectors used in solving DeepFool’s linearised problem onto each subspace before otherwise solving the problem as usual. The results, displayed in Fig. 5, thus represent DeepFool’s “earnest” attempts to attack the network as efficiently as possible within each given subspace. It is evident that the attack must exploit genuine classification directions in order to achieve low norm.

Accuracy () Fooling rate (%)
26798.72 63.96 57.75 100.00 100.00 100.00 100.00 100.00 100.00
26515.20 53.19 55.80 32.75 77.25 88.95 92.20 94.35 97.65
26327.03 46.86 53.50 35.55 58.35 77.90 85.95 89.25 95.65
26159.98 41.92 51.75 36.15 49.80 66.20 76.90 82.95 92.90
26008.02 37.98 48.10 41.65 49.25 59.95 68.05 74.80 88.30
Table 1: The images used to train AlexNet operate at the scale of (pixels on a side). In the pre-processing step, these images are downsized to , before being upsampled back to the original scale. The reconstructed DeepFool perturbations lose some of their effectiveness, as seen in the fooling-rate column for . When the effect of downsampling is countered by increasing the value of the -norms of these perturbations (using higher values of ), their efficacy is steadily restored. Note that the mean norms of images and perturbations are estimated in the upscaled space, as are the classification accuracies. The accuracy values for should be compared to those at convergence in Fig. 4. Any difference in the performance scores is strictly due to the random selection of the subset of test images used for evaluation.

4.4 On image compression and robustness to adversarial attack

The above observations have made it clear that the most effective directions of adversarial attack are also the directions that contribute the most to the DCNs’ classification performance. Hence, any attempt to mitigate adversarial vulnerability by discarding these directions, either by compressing the input data maharaj2015improving ; das_jpegcompression ; xie_randomisation or by suppressing specific components of image representations at intermediate network layers gao2017deepcloak , must effect a loss in the classification accuracy. Further, our framework anticipates the fact that the nets must remain just as vulnerable to attack along the remaining directions that continue to determine classification decisions, given that the corresponding class-score functions, which possess the properties discussed earlier, remain unchanged. We use image downsampling as an example data compression technique to illustrate this effect on ImageNet.

We proceed by inserting a pre-processing unit between the DCN and its input at test time. This unit downsamples the input image  to a lower size before upsampling it back to the original input size

. The resizing (by bicubic interpolation) serves to reduce the effective dimensionality of the input data. For a randomly selected set of

ImageNet ILSVRC15 test images, we observe the change in classification accuracy over different values of , shown in column 4 of Table 1. The fooling rates666Measured as a percentage of samples from the dataset that undergo a change in their predicted label. for the downsampled versions of these natural images’ adversarial counterparts, produced by applying DeepFool to the original network (without the resampling unit), follow in column 5 of the table. At first glance, it appears that the downsampling-based pre-processing unit has afforded an increase in the network robustness at a moderate cost in accuracy. Results pertaining to this tradeoff have been widely reported maharaj2015improving ; das_jpegcompression ; gao2017deepcloak . Here, we take the analysis a step further.

To start, we note the fact that the methodology just described represents a transfer attack from the original net to the net as modified by the inclusion of the resampling unit. As DeepFool perturbations  are not designed to transfer in this manner, we first augment them by simply increasing their -norm by a scalar factor . We adjust from unity up to a point at which the mean DeepFool perturbation norm is still a couple of orders of magnitude smaller than the mean image norm, such that the perturbations remain largely imperceptible. The corresponding fooling rates grow steadily with respect to , as is observable in Table 1. Hence, although the original full-resolution perturbations may be suboptimal attacks on the resampling variants of the network (as some components are effectively lost to projection onto the compressed space), sufficient rescaling restores their effectiveness. On the other hand, the modified net continues to be equally vulnerable along the remaining effective classification directions, and can easily be attacked directly. To go about this, we simply take the SVD of the stack of downsampled DeepFool perturbations, for values of and (owing to memory constraints). The resulting singular vectors span the entire space of classification/adversarial directions of the corresponding resampling network, as can be seen from the accuracy values in the rightmost subplot of Fig. 4. More crucially, lower-norm DeepFools can be obtained by restricting the attack’s iterative linear optimisation procedure to the space spanned by these compressed perturbations, exactly as described in Sec. 4.3 and displayed in Fig. 5. This subspace-confined optimisation is analogous to designing a white-box DeepFool attack for the new network architecture inclusive of the resampling unit, instead of the original network, and is as effective as before. Note that this observation is consistent with the results reported in xie_randomisation , where the strength of the examined gradient-based attack methods increases progressively as the targeted model better approximates the defending model.

5 Conclusion

In this work, we expose a collection of directions along which a given net’s class-score output functions exhibit striking similarity across sample images. These functions are nonlinear, but are de facto of a relatively constrained form: roughly axis-symmetric777Though not necessarily so for MNIST, because of its constraints: see Appendix Sec. A.4. and typically monotonic over large ranges. We illustrate a close relationship between these directions and class identity: many such directions effectively encode the extent to which the net believes that a particular target class is or is not present. Thus, as it stands, the predictive power and adversarial vulnerability of the studied nets are intertwined owing to the fact that they base their classification decisions on rather simplistic responses to components of the input images in specific directions, irrespective of whether the source of those components is natural or adversarial. Clearly, any gain in robustness obtained by suppressing the net’s response to these components must come at the cost of a corresponding loss of accuracy. We demonstrate this experimentally. We also note that these robustness gains may be lower than they appear, as the network actually remains vulnerable to a properly designed attack along the remaining directions it continues to use. A discussion including some nuanced observations and connections to existing work that follow from our study can be found in Appendix Sec. A.4. To conclude, we believe that for any scheme to be truly effective against the problem of adversarial vulnerability, it must lead to a fundamentally more insightful (and likely complicated) use of features than presently occurs. Until then, those features will continue to be the nets’ own worst adversaries.


This work was supported by the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. We would also like to acknowledge the Royal Academy of Engineering, FiveAI, and extend our thanks to Seyed-Mohsen Moosavi-Dezfooli for providing his research code for curvature analysis of decision boundaries of DCNs.


Appendix A Appendix

a.1 Illustration of fundamental geometric argument

Figure 6: The left image illustrates the geometric argument of moosavi-dezfooli2018robustness connecting positive curvature to universal perturbations: from the perspective of a source image , if the actual class boundary (solid black) can be locally bounded on the outside by an arc (dashed black) of fixed positive curvature along a particular direction, then a perturbation along that direction large enough to cross the bounding arc must carry across the actual boundary, changing its class. If such a direction is “shared” over many samples, it will accordingly represent a universal perturbation over that set. The right image illustrates that this effect is not particular to positive curvatures: in the negative-curvature case, the same holds true for an imagined point on the other side of the boundary. The central image shows that a “flat” direction cannot have a material effect on membership of either class.

a.2 Details and clarifications of use of differential geometric concepts

For a more involved treatment of the basic design of the decision boundary curvature analysis algorithm, consult the references moosavi-dezfooli2018robustness and fawzi2017classification from which it is derived. If still in need of further clarification of fundamental concepts of differential geometry, consult a book such as docarmo1976differential . We now address some specifics not discussed elsewhere, including in our own main text.

For one, it may seem strange to speak of the “curvature” of a decision boundary of a ReLU network, as analytically, this is everywhere either zero or undefined. “Curvature” is here (and in the relevant citations) computed via finite-difference numerical methods, implicitly corresponding to a smooth approximation of the actual piecewise linear boundary.

The astute reader may likewise notice that the use of the terms ‘principal direction’ and ‘principal curvature’ here (as originally in moosavi-dezfooli2018robustness ) is somewhat relaxed. Let us clarify this point. Strictly speaking, the principal directions at a point on a manifold in an embedding space (such as an -dimensional class decision boundary embedded in -dimensional image space) are a local concept, forming an orthonormal basis spanning the space tangent to the manifold at that point. The principal curvature associated with each principal direction is the curvature, at the point of tangency, of the normal section of the manifold in the principal direction.

Generally, there is no reason to think that tangent spaces at different points on the boundary surface should coincide, and so a priori, it may not make any sense to speak of “principal directions" in the embedding space. However, the authors of moosavi-dezfooli2018robustness are fully aware of this, and base their analysis on the hypothesis that there exist image-space directions which, when projected onto the respective tangent spaces of different points sampled from the boundary, correspond to similar curvature patterns. In other words, they assume that these directions are largely shared across sample images. A curvature can then be associated with each such direction by, for instance, taking the mean of the curvatures measured at those sample points. This is the relaxation of the terminology referred to above: the “principal directions” here represent a rotation of the canonical (-dimensional) image-space axes, and the “principal curvature” associated with each direction represents the sample mean curvature measured along the tangent component of that vector at each sample point in a set.

In practice, the above can in fact be simplified, as is done explicitly in the version of the algorithm given in the main text as Alg. 1. Rather than managing the expense and complexity of the tangent-space projections, it suits our purposes here to work directly with the Hessian of the embedding function, effectively omitting the projection step.

Finally, we note that in contrast to the simplification given in Alg. 1, the (very large) sample mean Hessians are never actually computed and stored explicitly. Instead, for each , a function is constructed that approximates

via finite differences of backpropagated gradients. The MATLAB function

eigs uses to compute the eigendecomposition of the approximated .

a.3 Class identity as a function of perturbation scale: negative-curvature and flat directions

Figure 7: Selected class scores plotted as functions of the scaling factor of the perturbation along the most negatively curved directions and flat directions per net. The ‘Median class score’ plot compares the score of a randomly selected target class with the supremum of the scores for the non-target classes, for the negatively curved directions. For the flat curvature directions, it plots the score of a randomly selected target class and non-target class respectively. Each curve represents the median of the class scores over the associated dataset, bracketed below by the 30th-percentile score and above by the 70th. For the negative-curvature directions, the ‘Transition out of target class’ graph works in reverse to the corresponding positive-curvature graph in Fig. 2 of the main paper: ‘sample proportion’ represents the fraction of the dataset originally of the target class which retains the target-class label under perturbation, with the median softmax target-class score as before. The ‘black dashed line’ now represents the fraction of the dataset not originally of the target class which remains outside of the target class under perturbation. The images in the rightmost column illustrate a sample of these directions as visual patterns. Each block of eight images corresponds to the label (negative, or flat) to its left, and the two-image columns in each block correspond from left to right with the main four net-dataset pairs under study.

a.4 Additional discussion, observations, and relationships with existing work

First, regarding the main result(s) of the paper as summarised in the conclusion (Sec. 5), we would like to point out that the discovery of the universal adversarial perturbations of moosavi2017universal strongly hinted at this outcome in advance. Those attacks are “universal” in precisely the sense that certain fixed directions perturb different images across class boundaries irrespective of the diversity in their individual appearances, i.e., irrespective of the input components in all directions orthogonal to the perturbation. In fact, it is precisely because the functions are as separable as they are in this sense that the method used is able to identify them as well as it does. Note also that our results explain the “dominant label” phenomenon of moosavi2017universal , noted therein as curious but otherwise left unaddressed, in which a given universal perturbation overwhelmingly moves examples into a small number of target classes regardless as to original class identities. This is a manifestation of the targeting of particular classes by particular directions, a phenomenon so strong that it manifests in moosavi2017universal despite the fact that their algorithm never explicitly optimises for this property.

In the context of discussing the adversarial vulnerability of DCNs, we would advise caution in using terms ‘overfitting’ and ‘generalisation’, particularly if done speculatively. Inspection of Figs. 3 (top row) and 4 will convince the reader that the directions of vulnerability produce fits that generalise very well to unseen data generated by the same distribution, observable as the near identity between training and test accuracy curves over the directions of highest curvature (or derivative) magnitude. This does not

correspond to the classical notion of overfitting in machine learning. In fact, overfitting, observable as the divergence between the train and test error curves, happens over less salient (and thus, less vulnerable) directions. This can again be confirmed by inspection. The fact that the nets extract characterisations of their targets that do not correspond to that of humans is a separate issue.

Finally, the approximate symmetry of the feature response functions may appear counter-intuitive: why should the additive inverse of a given perturbation pattern produce such a similar result to the original pattern? We suggest that at least part of the explanation may rest in a fact long known in the hand-engineering of features for computer vision: natural image descriptors must typically neglect sign, because contrast inversion is a fact of the world. For instance, consider how a black bird and a white bird, which share the label bird, differ from one another when both are set against the background of a blue sky. One can revisit, for instance, dalal2005histograms for a reminder. Empirically, this nonlinearity appears to be particularly important. Note that DCNs trained on MNIST are exceptional in that they may not adhere to this, as MNIST is unnatural in its fixing of contrast sign: e.g. the net need never learn that a black vertical stroke against a white background also represents the digit ‘1’.