# A Boundary Tilting Persepective on the Phenomenon of Adversarial Examples

• 10 publications
• 1 publication
12/20/2014

### Explaining and Harnessing Adversarial Examples

Several machine learning models, including neural networks, consistently...
08/21/2018

### zoNNscan : a boundary-entropy index for zone inspection of neural models

The training of deep neural network classifiers results in decision boun...
06/23/2021

### Adversarial Examples in Multi-Layer Random ReLU Networks

We consider the phenomenon of adversarial examples in ReLU networks with...
12/31/2021

### Benign Overfitting in Adversarially Robust Linear Classification

"Benign overfitting", where classifiers memorize noisy training data yet...
04/25/2022

### When adversarial examples are excusable

Neural networks work remarkably well in practice and theoretically they ...
05/26/2022

### An Analytic Framework for Robust Training of Artificial Neural Networks

The reliability of a learning model is key to the successful deployment ...
01/28/2019

### Adversarial Examples Target Topological Holes in Deep Networks

It is currently unclear why adversarial examples are easy to construct f...

## 1 Introduction

Tremendous progress has been made in the field of Deep Learning in recent years. Convolutional Neural Networks in particular, started to deliver promising results in 2012 on the ImageNet Large Scale Visual Recognition Challenge

(krizhevsky2012imagenet). Since then, improvements have come at a very high pace: the range of applications has widened (xu2015show; mnih2015human), network architectures have become deeper and more complex (szegedy2015going; simonyan2014very), training methods have improved (he2015deep), and other important tricks have helped increase classification performance and reduce training time (srivastava2014dropout; ioffe2015batch). As a consequence, deep networks that are able to outperform humans are now being produced: for instance on the challenging imageNet dataset (he2015delving)

, or on face recognition

(schroff2015facenet). Yet the same networks present a surprising weakness: their classifications are extremely sensitive to some small, non-random perturbations (szegedy2013intriguing). As a result, any correctly classified image possesses adversarial examples: perturbed images that appear identical (or nearly identical) to the original image according to human observers — and hence that should belong to the same class — that are classified differently by the networks (see figure 1). There seems to be a fundamental contradiction in the existence of adversarial examples in state-of-the-art neural networks. On the one hand, these classifiers learn powerful representations on their inputs, resulting in high performance classification. On the other hand, every image of each class is only a small perturbation away from an image of a different class. Stated differently, the classes defined in image space seem to be both well-separated and intersecting everywhere. In the following, we refer to this apparent contradiction as the adversarial examples paradox.

In section 2, we present two existing answers to this paradox including the currently accepted linear explanation of goodfellow2014explaining. In section 3, we argue that the linear explanation presents a number of limitations: the formal argument is unconvincing; we can define classes of images on which linear models do not suffer from the phenomenon; and the adversarial examples affecting logistic regression on the 3s vs 7s MNIST problem appear qualitatively very different from the ones affecting GoogLeNet on ImageNet. In section 4, we introduce the boundary tilting perspective. We start by presenting a new pictorial solution to the adversarial examples paradox: a submanifold of sampled data, intersected by a class boundary that lies close to it, suffers from adversarial examples. Then we develop a mathematical analysis of the new perspective in the linear case. We define a strict condition for the non-existence of adversarial examples, from which we deduce a measure of strength for the adversarial examples affecting a class of images. Then we show that the adversarial strength can be reduced to a simple parameter: the deviation angle

between the weight vector of the classifier considered and the weight vector of the nearest centroid classifier. We also show that the adversarial strength can become arbitrarily high without affecting performance when the classification boundary tilts along a component of low variance in the data. This result leads us to defining a new taxonomy of adversarial examples. Finally, we show experimentally using SVM that the adversarial strength observed in practice is controlled by the level of regularisation used. With very high regularisation, the phenomenon of adversarial examples is minimised and the classifier defined converges towards the nearest centroid classifier. With very low regularisation however, the training data is overfitted by boundary tilting, leading to the existence of strong adversarial examples.

## 2 Previous Explanations

### 2.1 Low-probability “pockets” in the manifold

In (szegedy2013intriguing), the existence of adversarial examples was regarded as an intriguing phenomenon. No detailed explanation was proposed, and only a simple analogy was introduced:

“Possible explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set,

yet it is dense (much like the rational numbers), and so it is found virtually near every test case” [emphasis added]

Using the mathematical concept of density, and the example of the rational numbers in particular, we can indeed define a classifier that suffers from the phenomenon of adversarial examples. Consider the classifier operating on the real numbers with the following decision rule for a test number :

• [parsep=0cm, itemsep=0cm, topsep=0cm]

• belongs to if it is positive irrational or negative rational.

• belongs to if it is negative irrational or positive rational.

On a test set selected at random among real numbers, discriminates perfectly between positive and negative numbers: real numbers contain infinitely more irrational numbers than rational numbers and for whatever test number we choose at random among real numbers, is infinitely likely to be irrational, and thus correctly classified. Yet suffers from the phenomenon of adversarial examples: since the set of rational numbers is dense in the set of real numbers, is infinitely close to rational numbers that constitute adversarial examples.

The rational numbers analogy is interesting, but it leaves one important question open: why would deep networks define decision rules that are in any way as strange as the one defined by our example classifier ? By what mechanism should the low-probability “pockets” be created? Without attempting to provide a detailed answer, szegedy2013intriguing suggested that it was made possible by the high non-linearity of deep networks.

### 2.2 Linear explanation

goodfellow2014explaining subsequently provided a more detailed analysis of the phenomenon, and introduced the linear explanation — currently generally accepted. Their explanation relies on a new analogy:

“We can think of this as a sort of ‘accidental steganography’, where a linear model is forced to attend exclusively to the signal that aligns most closely with its weights, even if multiple signals are present and other signals have much greater amplitude.” [emphasis added]

Given an input and an adversarial example where is subject to the constraint , the argument is the following:

“Consider the dot product between a weight vector and an adversarial example :

 w⊤⋅~x=w⊤⋅x+w⊤⋅η

The adversarial perturbation causes the activation to grow by . We can maximise this increase subject to the max norm constraint on by assigning . If has dimensions and the average magnitude of an element of the weight vector is , then the activation will grow by . Since does not grow with the dimensionality of the problem but the change in activation caused by the perturbation by can grow linearly with , then for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output.”

The authors concluded that “a simple linear model can have adversarial examples if its input has sufficient dimensionality”. This argument was followed with the observation that small linear movements in the direction of the sign of the gradient (with respect to the input image) can cause deep networks to change their predictions, and hence that “linear behaviour in high-dimensional spaces is sufficient to cause adversarial examples”.

Technical remarks: [parsep=0.1cm, itemsep=0.1cm, topsep=0.1cm] What norm should be used to evaluate the magnitude of a small perturbation? The image perturbations used to generate adversarial examples are typically measured with a norm that does not necessarily match perceptual magnitude. For instance, goodfellow2014explaining use the infinity norm, based on the idea that digital measuring devices are insensitive to small perturbations whose infinity norm is below a certain threshold (because of digital quantization). This is a reasonable but arbitrary choice. We might consider other norms more adapted (such as 1- or 2-norm) — because for human observers, the magnitude of a perturbation does not only depend on the maximum change along individual pixels but also on the number of changing pixels111A perturbation of on the pixel in the top left corner of an image does not have the same perceptual magnitude as a perturbation of across the entire image. Yet the infinity norm gives the same magnitude to the two perturbations.. This is a fairly technical point of little importance in practice, except for determining the specific direction in which to move when looking for adversarial examples. We use the 2-norm, so that the direction we move in is simply the direction of the gradient. In other words, we create adversarial examples by adding the quantity to the input image, instead of adding the quantity , as one does for the infinity norm. In previous works, the phenomenon of adversarial examples in linear classification was investigated using logistic regression (szegedy2013intriguing; goodfellow2014explaining)

. In the present study, we use another standard linear classifier: support vector machine (SVM) with linear kernel. The two methods are largely equivalent but we prefer SVM for its geometrical interpretation, more adapted to the boundary tilting perspective we introduce in the following.

## 3 Limitations with the Linear Explanation

### 3.1 An unconvincing argument

The idea of accidental steganography is a seducing intuition that seems to illustrate well the phenomenon of adversarial examples. Yet the argument is unconvincing: small perturbations do not provoke changes in activation that grow linearly with the dimensionality of the problem, when they are considered relatively to the activations themselves. Consider the dot product between a weight vector and an adversarial example again: . As we have seen before, the change in activation grows linearly with the problem; but so does the activation (provided that the weight and pixel distributions in and stay unchanged), and the ratio between the two quantities stays constant.

We illustrate this by performing linear classification on a modified version of the 3s vs 7s MNIST problem where the image size has been increased to

. We generated the new dimensions by linear interpolation and increased variability by adding some noise to the original and the modified datasets (random perturbations between

on every pixel). The results for the two image sizes look strikingly similar (see figure 2). Importantly, increasing the image resolution has no influence on the perceptual magnitude of the adversarial perturbations, even if the dimension of the problem has been multiplied by more than 50.

In sum, the dimensionality argument does not hold: high dimensional problems are not necessarily more prone to the phenomenon of adversarial examples. Without this central result however, can we still maintain that linear behaviour is sufficient to cause adversarial examples?

### 3.2 Linear behaviour is not sufficient to cause adversarial examples

According to the linear explanation of goodfellow2014explaining, linear behaviour itself is responsible for the existence of adversarial examples. If we take this explanation literally, then we expect all linear classification problems to suffer from the phenomenon. Yet we can find classes of images for which adversarial examples do not exist at all. Consider the following toy problem (figure 3).

Let and be two classes of images of size defined as follow:

[parsep=0.1cm, itemsep=0.1cm, topsep=0.2cm]

Class .

Left half-image noisy (random pixel values in ) and right half-image black (pixel value: 0).

Class .

Left half-image noisy (random pixel values in ) and right half-image white (pixel value: 1).

If we train a linear SVM on images of each class, we achieve perfect separation of the training data with full generalisation to novel test data. When we look at the weight vector defined by SVM, we notice that it correctly represents the feature separating the two classes: it ignores the left half-image (all weights near zero) and takes into consideration the entire right half-image (all weights near 1). As a result, adversarial examples do not exist. Indeed, if we take an image in one of the two classes and move in the gradient direction until we reach the class boundary, then we get an image that is also perceived as being between the two classes according to human observers (grey right half-image); and if we continue to move in the gradient direction until we reach a confidence level that the new image belongs to the new class equal to the confidence level that the original image belonged to the original class, then we get an image that is also perceived as belonging to the new class according to human observers.

This toy problem is very artificial and the point we make from it might seem little convincing for the moment, but it should not be disputed that there is a priori nothing in the current linear explanation that allows us to predict which classes of images will suffer from the phenomenon of adversarial examples, and which will not. In the following section we consider a more realistic problem: MNIST. We will return to the toy problem in section

4.3.

### 3.3 Linear classification on MNIST. Are these examples really adversarial?

A key argument in favour of the linear explanation of adversarial examples was that logistic regression also suffers from the phenomenon. In contrast, we argue here that what happens with linear classifiers on MNIST is very different from what happens with deep networks on ImageNet.

The first difference between the two situations is very clear: the adversarial perturbations have a much higher magnitude and are very perceptible by human observers in the case of linear classifiers on MNIST (see figure 1). Importantly, the image resolution cannot account for this difference: increasing the size of the MNIST images does not influence the perceptual magnitude of the adversarial perturbations (as shown in section 3.1). Not only does the linear explanation unreliably predict whether the phenomenon of adversarial examples will occur on a specific dataset (as shown in section 3.2), it also cannot predict the magnitude of the adversarial perturbations necessary to make the classifier change its predictions when the phenomenon does occur.

Another important difference between the adversarial examples shown in (goodfellow2014explaining) for GoogLeNet on ImageNet and the ones shown for logistic regression on MNIST concerns the appearance of the adversarial perturbations. With GoogLeNet on ImageNet, the perturbation is dominated by high-frequency structure which cannot be meaningfully interpreted; with logistic regression on MNIST, the perturbation is low-frequency dominated and although goodfellow2014explaining argue that it is “not readily recognizable to a human observer as having anything to do with the relationship between 3s and 7s”, we believe that it can be meaningfully interpreted: the weight vector found by logistic regression points in a direction that is close to passing through the mean images of the two classes, thus defining a decision boundary similar to the one of a nearest centroid classifier (see figure 4).

Simple linear models defined by SVM or logistic regression can be deceived on MNIST by perturbations that are visually perceptible and that look roughly like the weight vector of the nearest centroid classifier. This result is hardly surprising and does not help explain why much more sophisticated models — such as deep networks — can be deceived by imperceptible perturbations which look to human observers like random noise. Clearly, the linear explanation is still incomplete.

## 4 The Boundary Tilting Perspective

In previous sections, we rejected the linear explanation of goodfellow2014explaining: high dimension is insufficient to explain the phenomenon of adversarial examples and linear models seem to suffer from a weaker type of adversarial examples than deep networks. Without the linear explanation however, the adversarial examples paradox persists: how can two classes of images be well separated, if every element of each class is close to an element of the other class?

In figure 4(a), we present a schematic representation of the solution proposed in (szegedy2013intriguing): the classes and are well separated, but every element of each class is very close to an element of the other class because low probability adversarial pockets are densely distributed in image space. In figure 4(b), we introduce a new solution. First, we observe that the data sampled in the training and test sets only extends in a submanifold of the image space. A class boundary can intersect this submanifold such that the two classes are well separated, but will also extend beyond it. Under certain circumstances, the boundary might be lying very close to the data, such that small perturbations directed towards the boundary might cross it.

Note that in the low dimensional representation of figure 4(b), randomly perturbed images are likely to cross the class boundary. In higher dimension however, the probability that a random perturbation moves exactly in the direction of the boundary is low, such that images that are close to it (and thus sensitive to adversarial perturbations), are robust to random perturbations, in accordance with the results in (szegedy2013intriguing).

### 4.2 Adversarial examples in linear classification

The drawing of figure 4(b) is, of course, a severe oversimplification of the reality — but it is a useful one. As we noticed already, it is a low dimensional impression of a phenomenon happening in much higher dimension. It also misrepresents the complexity of real data distributions and the highly non-linear nature of the class boundary defined by a state-of-the-art classifier. Yet it is useful because it allows us to make important predictions. First, the drawing is compatible with a flat class boundary and no non-linearity is required (contrary to the view relying on the presence of low probability pockets). Hence the phenomenon of adversarial examples should be observable in linear classification. At the same time, linear behaviour is not sufficient for the phenomenon to occur either: the class boundary needs to “be tilted” and lie close to the data. In the following, we propose a mathematical analysis of this boundary tilting explanation in linear classification. We start by giving a strict condition for the non-existence of adversarial examples, from which we deduce a measure of strength for the adversarial examples affecting a class of images. We also show that the adversarial strength can be reduced to a simple parameter: the deviation angle between the classifier considered and the nearest centroid classifier. Then, we introduce the boundary tilting mechanism and show that it can lead to adversarial examples of arbitrary strength without affecting classification performance. Finally, we propose a new taxonomy of adversarial examples.

#### 4.2.1 Condition for the non-existence of adversarial examples

In the standard procedure, adversarial examples are found by moving along the gradient direction by a magnitude chosen such that 99% of the data is misclassified (goodfellow2014explaining). The smaller is, the more “impressive” the resulting adversarial examples. This approach is meaningful when is very small — but as grows, when should one stop considering the images obtained as adversarial examples? When they start to actually look like images of the other class? Or when the adversarial perturbation starts to be perceptible to the human eye? Here, we introduce a strict condition for the non-existence of adversarial examples.

Let and be two classes of images, and

a hyperplane boundary defining a linear classifier in

. is formally specified by a normal weight vector (we assume that ) and a bias . For any image in , we define:

• [parsep=0cm, itemsep=0.1cm, topsep=0.1cm]

• The classification score of through as:
is the signed distance between and .
is classified in if and is classified in if .

• The projected image of on as:
is the nearest image lying on (i.e. such that ).

• The mirror image of through as:
is the nearest image with opposed classification score (i.e. such that ).

• The mirror class of through as:

Suppose that does not suffer from adversarial examples. Then for every image in , the projected image must lie exactly between the classes and . Since is the midpoint between and the mirror image , we can say that lies exactly between and iff belongs to . Hence we can say that the class does not suffer from adversarial examples iff . Similarly, we can say that the class does not suffer from adversarial examples iff . Since the mirror operation is involutive, we have and . Hence:

 C \text@underline{does not} suffer from adversarial % examples⇔m(I,C)=J and m(J,C)=I

The non-existence of adversarial examples is equivalent to the classes and being mirror classes of each other through , or to the mirror operator defining a bijection between and . Conversely, we say that a classification boundary suffers from adversarial examples iff and . In that case, we call adversarial examples affecting the elements of that are not in and we call adversarial examples affecting the elements of that are not in .

#### 4.2.2 Strength of the adversarial examples affecting a class of images

As discussed before, the magnitude of the adversarial perturbations used in the standard procedure is a good measure of how “impressive” or “strong” the adversarial examples are. Unfortunately, this measure is only meaningful for small values. We introduce here a measure of strength that is valid on the entire spectrum of the adversarial example phenomenon.

Maximum strength. Let us note and the mean images of and respectively. For an element in , the “strength” of the adversarial example is maximised when the distance tends to 0 (this is equivalent to tending to 0 in the standard procedure). Averaging over all the elements of , we can say that the strength of the adversarial examples affecting is maximised when the distance tends to 0 (see figure 6).

Remark that and consider the projections of the elements in along the direction : their mean value is and we note

their standard deviation. Consider in particular the elements

in that are more than one standard deviation away from the mean in the direction : for each element in we have

. If there are no strong outliers in the data, a significant proportion of the elements of

belongs to , and if the classifier has a good performance, some of the elements in must be correctly classified in , i.e. some elements in must verify . Hence we must have and . We can thus write: . The strength of the adversarial examples affecting is maximised () when there is a direction of very small variance in the data () and the boundary lies close to the data along this direction ().

Minimum strength. We call the hyperplane of the nearest centroid classifier the bisecting boundary, and denote it . By definition, is the unique classification boundary verifying (we assume that such that is well-defined). Remark that we have, for a classification boundary :

 m(I,C)=J⟹m(i,C)=jbutm(i,B)=j\centernot⟹m(I,B)=J

Hence, if there exists a classification boundary that does not suffer from adversarial examples on , then it is unique and equal to ; but can suffer from adversarial examples. In the following, we consider that minimises the phenomenon of adversarial examples, even when does suffer from adversarial examples (see figure 7, left). Then, we can say that the strength of the adversarial examples affecting is minimised when the distance tends to 0 (see figure 7, right).

Based on the previous considerations, and using the arctangent in order to bound the values in the finite interval , we formally define the strength of the adversarial examples affecting through as:

 s(I,C)=arctan(∥j−m(i,C)∥∥i−m(i,C)∥)

is maximised at when and minimised at 0 when

#### 4.2.3 The adversarial strength is the deviation angle

In our analysis, the bisecting boundary of the nearest centroid classifier plays a special role: it minimises the strength of the adversarial examples affecting and . We note its normal weight vector (we assume that ) and its bias. Given a classifier specified by a normal weight vector and a bias , we call deviation angle of with regards to the angle between and . More precisely, we can express as a function of , a unit vector orthogonal to that we note , and the deviation angle as:

 c=cos(δc)b+sin(δc)b⊥c

We can then derive (see appendix A) the strengths of the adversarial examples affecting and through in terms of the deviation angle and the ratio (with the origin at the midpoint between and ):

 s(I,C)=arctan⎛⎜ ⎜⎝√sin2(δc)+r2ccos(δc)+rc⎞⎟ ⎟⎠ands(J,C)=arctan⎛⎜ ⎜⎝√sin2(δc)+r2ccos(δc)−rc⎞⎟ ⎟⎠

Effect of :
If we assume that separates and , then we must have .
.
.
The parameter controls the relative strengths of the adversarial examples affecting and . It can lead to strong adversarial examples on one class at a time (see figure 8).

In the following, we assume that , so that:

 s(I,C)≈s(J,C)≈s(C)=arctan⎛⎜ ⎜⎝√sin2(δc)cos(δc)⎞⎟ ⎟⎠=|δc|

In words, when passes close to the mean of the classes centroids (), the strength of the adversarial examples affecting is approximately equal to the strength of the adversarial examples affecting and can be reduced to the deviation angle . In that case we can speak of the adversarial strength without mentioning the class affected: it is minimised for (i.e. ) and maximised when tends to .

#### 4.2.4 Boundary tilting and its influence on classification

In previous sections, we defined the notion of adversarial strength and showed that it can be reduced to the deviation angle between the weight vector of the classifier considered and the weight vector of the nearest centroid classifier. Here, we evaluate the effect on the classification performance of tilting the weight vector by an angle along an arbitrary direction.

Let be a unit vector that we call the zenith direction. We can express as a function of , a unit vector orthogonal to that we note and an angle that we call the inclination angle of along :

 c=cos(θc)z⊥c+sin(θc)z

We say that we tilt the boundary along the zenith direction by an angle when we define a new boundary specified by its normal weight vector and its bias as follow:

 cθ=cos(θc+θ)z⊥c+sin(θc+θ)z
 cθ0=c0cos(θc+θ)/cos(θc)

Let be the set of all the images in and . Abusing the notation, we refer to the sets of all classification scores through and by and . We can show (see appendix B) that:

Where and are the unit vectors rotated by the angles and relatively to the x-axis and is the projection of on the plane horizontally translated by .

Now we define the rate of change between and and note the proportion of elements in that are classified differently by and (i.e. the elements in for which ). In general, we cannot deduce a closed-form expression of . However, we can represent it graphically in the plane and we see that is small as long as the variance of the data in along the zenith direction is small and the angle is not too close to (see figure 9).

Let us note and the variances of the data in along the directions and respectively. We present below two situations of interest where can be expressed in closed-form.

1. [parsep=0.1cm, itemsep=0.1cm, topsep=0.1cm]

2. When is flat along the zenith component (i.e. when is null), we have:

 d(S,C)=cos(θc)(S⋅z⊥c+c0/cos(θc))andd(S,Cθ)=cos(θc+θ)(S⋅z⊥c+c0/cos(θc))

Hence:

 d(S,Cθ)=cos(θc+θ)cos(θc)d(S,C)

For all in , the sign of is equal to the sign of : every element of is classified in the same way by and and .
When the variance along the zenith direction is null, the classification of the elements in is unaffected by the tilting of the boundary.

3. When

follows a bivariate normal distribution

with , then we can show (see appendix C) that:

 roc(θ)=1π⎡⎣arctan⎛⎝√vzv⊥ztan(x)⎞⎠⎤⎦θc+θθc

For instance if and , and the boundaries and are tilted at and respectively along ( and )), then we have .
When the variance along the zenith direction is small enough, the classification of the elements in is very lightly affected by the tilting of the boundary.

#### 4.2.5 Boundary tilting at the origin of strong adversarial examples

Finally, we show that the boundary tilting mechanism can lead to the existence of strong adversarial examples, without affecting the classification performance.

Imagine that we choose the zenith direction orthogonal to . Then we can express as a function of , a unit vector orthogonal to (and ) that we note and an angle that we call the azimuth angle of with regards to and :

 c=cos(θc)[cos(ϕc)b+sin(ϕc)yc]+sin(θc)z

Now, imagine that we tilt the boundary along the zenith direction while keeping the azimuth angle constant. We can express the weight vector of the tilted boundary both as a function of its inclination angle and the azimuth angle , and as a function of its deviation angle :

 cθ=cos(θc+θ)[cos(ϕc)b+sin(ϕc)yc]+sin(θc+θ)zandcθ=cos(δc+δ)b+sin(δc+δ)b⊥c

We see that the deviation angle of depends on the inclination angle and the azimuth angle :

 cos(δc+δ)=cos(θc+θ)cos(ϕc)

In order for to suffer from strong adversarial examples (i.e. ), it is sufficient to tilt along a zenith direction orthogonal to (i.e. ). If in addition the direction is such that the variance is small, then the rate of change will be small and the classification boundaries and will perform similarly (when , and perform exactly in the same way: see figure 10).

For any classification boundary , there always exist a tilted boundary such that and perform in the same way or almost in the same way , and suffers from adversarial examples of arbitrary strength (as long as there are directions of low variance in the data).

#### 4.2.6 Taxonomy of adversarial examples

Given a classifier , we note its deviation angle and its error rate on . In the following, we analyse the distribution of all linear classifiers in the deviation angle - error rate diagram. To start with, we consider the nearest centroid classifier as a baseline and discard all classifiers with an error rate superior to as poorly performing. We also note the minimum error rate achievable on (in general, ). For a given error rate comprised between and , we say that a classifier is optimal if it minimises the deviation angle. In particular, we call label boundary and we note the optimal classifier verifying . In the deviation angle - error rate diagram, the set of optimal classifiers forms a strictly decreasing curve segment connecting (minimising the strength of the adversarial examples) to (minimising the error rate). Any classifier with a deviation angle greater than is then necessarily suboptimal: there is always another classifier performing at least as well and suffering from weaker adversarial examples (see figure 11).

Based on these considerations, we propose to define the following taxonomy:

[parsep=0.1cm, itemsep=0cm, topsep=0.1cm]

Type 0:

Type 1:

adversarial examples affecting the classifiers such that . They affect in particular the optimal classifiers. The inconvenience of their existence is balanced by the performance gains allowed.

Type 2:

adversarial examples affecting the classifiers such that . They only affect suboptimal classifiers resulting from the tilting of optimal classifiers along directions of low variance.

Let us call training boundary and note the boundary defined by a standard classification method such as SVM or logistic regression. In practice, and are unlikely to be mirror classes of each other through and hence is expected to at least suffer from type 0 adversarial examples. In fact, is also unlikely to minimise the error rate on and if performs better than , then is also expected to suffer from type 1 adversarial examples. Note that there is no restriction in theory on and on some problems, type 1 adversarial examples can be very strong. However, is a priori not expected to suffer from type 2 adversarial examples: why would SVM or logistic regression define a classifier that is suboptimal in such a way? In the following two sections, we show experimentally with SVM that the regularisation level plays a crucial role in controlling the deviation angle of . When the regularisation level is very strong (i.e. when the SVM margin contains all the data), converges towards and the deviation angle is null. When SVM is correctly regularised, is allowed to deviate from sufficiently to converge towards : the optimal classifier minimising the error rate. However when the regularisation level is too low, the inclination of along directions of low variance ends up overfitting the training data, resulting in the existence of strong type 2 adversarial examples.

In light of the mathematical analysis presented in the previous sections, we now return to the toy problem introduced in section 3.2 (see figure 3). Firstly, we can confirm that the boundary defined by SVM satisfies the condition we gave for the non-existence of adversarial examples: the weight vector is equal to the weight vector of the nearest centroid classifier (see figure 12) and we have and . Indeed, mirroring an image that belongs to through changes the colour of its right half image from black to white and results in an image that belongs to (and conversely).

Secondly, we can illustrate the effect of the regularisation level used on the deviation angle (and hence on the adversarial strength). To start with, we modify the toy problem such that (when , overfitting is not likely to happen). We do this by corrupting 5% of the images in and into fully randomised images, such that (half of the corrupted data is necessarily misclassified). Note that on this problem, , hence and is the only optimal classifier. When we perform SVM with regularisation (soft-margin), we obtain a weight vector approximately equal to (see figure 13). The small deviation can be explained by the fact that the training data has been slightly overfitted (the training error is ) and corresponds to very weak adversarial examples. Without regularisation however (hard-margin), the deviation of the weight vector is very strong (see figure 14). In that case, the training data is completely overfitted (the training error is 0%), resulting in the existence of strong type 2 adversarial examples. Interestingly, these adversarial examples possess the same characteristics as the ones observed with GoogLeNet on ImageNet in (goodfellow2014explaining) — the perturbation is barely perceptible, high-frequency and cannot be meaningfully interpreted — even though the classifier is linear.

Finally, we can visualise the boundary tilting mechanism by plotting the projections of the data on the plane , where is the zenith direction along which is tilted (see figure 15). We observe in particular how the overfitting of the corrupted data leads to the existence of the strong type 2 adversarial examples: maximising the minimal separation of the two classes (the margin) results in a very small average separation (making adversarial examples possible). This effect is very reminiscent of the data piling phenomenon studied by marron2007distance and ahn2010maximal on high-dimension low-sample size data.

We now revisit the 3s vs 7s MNIST problem. In particular, we study the effect of varying the regularisation level by performing SVM classification with seven different values for the soft-margin parameter: and . The first remark we can make is that there is a strong, direct correlation between the deviation angle of the weight vector defined by SVM and the regularisation level used (see figure 16, left). When regularisation is high (i.e. when is low), the SVM weight vector is very close to the weight vector of the nearest centroid classifier (). Conversely when regularisation is low (i.e. when is high), the SVM weight vector is almost orthogonal to (). As expected, the error rate on test data is minimised for an intermediate level of regularisation and overfitting happens for low regularisation: for and , the error rate on training data approaches 0% while the error rate on test data increases (see figure 16, right).

When we look at the SVM weight vector for the different levels of regularisation (see figure 17, left), we see that it initially resembles the weight vector of the nearest centroid classifier (), then deviates away into relatively low frequency directions ( and ) before deviating into higher frequency directions, resulting in a “random noise aspect”, when the training data starts to be overfitted ( and ). Let us consider the one-dimensional subspace of generated by , and the 783-dimensional subspace of , orthogonal complement of . We note and the projections of the training set on and

respectively and we perform a principal component analysis of

, resulting in the 783 principal vectors , …, . Then, we decompose into 27 subspaces of 29 dimensions each, such that is generated by , is generated by , …, and is generated by . For each weight vector , we decompose it into a component in and a component in and we project on each subspace , …, (see figure 17, middle). The norms of the projections of are shown as orange bar charts and the square roots of the total variances in each subspace , …, are shown as blue curves. We see that for and , is dominated by components of high variance, while for and , starts to be more dominated by components of low variance: this result confirms that overfitting happens by the tilting of the boundary along components of low variance. Note that never tilts along flat directions of variation (corresponding to the subspaces ) because for overfitting to take place, there needs to be some variance in the tilting direction. Interestingly, optimal classification seems to happen when each direction is used proportionally to the amount of variance it contains: for , the bar chart follows the blue curve faithfully. Finally, we can look at the adversarial examples affecting each weight vector (see figure 17, right). In particular, we look at the images of 3s in the test set that are at a median distance from each boundary (median images). We see that the mirror images are closer to their respective original images when the regularisation level is low, resulting in stronger adversarial examples. For , the deviation angle is almost null and we can say that the corresponding adversarial example is of type 0. For and , the increase in deviation angle is associated with an increase in performance and we can say that the corresponding adversarial examples are of type 1. However, for and , the increase in deviation angle only results in overfitting, and we can say that the corresponding adversarial examples are of type 2.

These type 2 adversarial examples, like those found on the toy problem, have similar characteristics to the ones affecting GoogLeNet on ImageNet (the adversarial perturbation is barely perceptible and high-frequency). Hence we may hypothesize that the adversarial examples affecting deep networks are also of type 2, originating from a non-linear equivalent of boundary-tilting and caused by overfitting. If this hypothesis is correct, then these adversarial examples might also be fixable by using adapted regularisation. Unfortunately, straightforward l2 regularisation only works when the classification method operates on pixel values: as soon as the regularisation term is applied in a feature space that does not directly reflect pixel distance, it does not effectively prevent the existence of type 2 adversarial examples any more. We illustrate this by performing linear SVM with soft-margin regularisation after two different standard preprocessing methods: pixelwise normalisation and PCA whitening. In the two cases, the soft-margin parameter is chosen such that the performance is maximised, resulting in a slight boost in performance both for pixelwise normalisation () and for PCA whitening (

). Since the preprocessing steps are linear transformations, we can then project the weight vectors obtained back into the original pixel space. We get a deviation angle for the weight vector defined after pixelwise normalisation that is stronger than that of any weight vector defined without preprocessing (

) and a deviation angle for the weight vector defined after PCA whitening that appears orthogonal to (). The two weight vectors (see figure 18, left) have a very peculiar aspect: both are strongly dominated by a few pixels, in the periphery of the image for the weight vector defined after pixelwise normalisation and in the top right corner for the weight vector defined after PCA whitening. When we look at the magnitudes of the projections of the components on the subspaces , we see that the dominant pixels correspond to the components where the variance of the data is smallest but non-null (see figure 18, middle). Effectively, the rescaling of the components of very low variance puts a disproportionate weight on them, forcing the boundary to tilt very significantly. The phenomenon is particularly extreme with PCA whitening where due to numerical approximations, some residual variance was found in components that were not supposed to contain any, and ended up strongly dominating the weight vector333This effect could be avoided by putting a threshold on the minimum variance necessary before rescaling, as is sometimes done in practice.. The resulting adversarial examples are unusual (see figure 18, right). For the pixelwise normalisation preprocessing step, it is possible to change the class of an image by altering the value of pixels that do not affect the digit itself. For the PCA whitening preprocessing step, the perturbation is absolutely non-perceptible: the pixel distance between the original image and the corresponding adversarial example is in the order of . With such a small distance, classification is now very sensitive to any perturbation, whether it is adversarial or random (despite this obvious weakness, this classifier performs very well on normal data).

## 5 Conclusion

This paper contributes to the understanding of the adversarial example phenomenon in several different ways. It introduces in particular:

[parsep=0.1cm, itemsep=0cm, topsep=0.1cm]

A new perspective.

The phenomenon is captured in one intuitive picture: a submanifold of sampled data, intersected by a class boundary lying close to it, suffers from adversarial examples.

A new formalism.

In linear classification, we proposed a strict condition for the non-existence of adversarial examples. We defined adversarial examples as elements of the mirror class and introduced the notion of adversarial strength. Given a classification boundary , we showed that the adversarial strength can be measured by the deviation angle between and the bisecting boundary of the nearest centroid classifier. We also defined the boundary tilting mechanism, and showed that there always exists a tilted boundary such that and perform in very similar ways, and suffers from adversarial examples of arbitrary strength (as long as there are directions of low variance in the data).

A new taxonomy.

These results led us to define the notion of optimal classifier, minimising the deviation angle for a given error rate. is the optimal classifier minimising the adversarial strength and we called label boundary the optimal classifier minimising the error rate. When and the two classes of images are not mirror classes of each other, we say that suffers from adversarial examples of type 0. When the error rate of is strictly inferior to the error rate of , the deviation angle of is necessarily strictly positive; as long as it stays inferior to the deviation angle of , we say that suffers from adversarial examples of type 1. When the deviation angle of is superior to the deviation angle of , is necessarily suboptimal. In that case we say that suffers from adversarial examples of type 2.

New experimental results.

We introduced a toy problem that does not suffer from adversarial examples, and presented a minimal set of conditions to provoke the apparition of strong type 2 adversarial examples on it. We also showed on the 3s vs 7s MNIST problem that in practice, the regularisation level used plays a key role in controlling the deviation angle, and hence the type of adversarial examples obtained. Type 2 adversarial examples in particular, can be avoided by using a proper level of regularisation. However, we showed that l2 regularisation only helps when it is applied directly in pixel space.

## Appendix

#### A  Expression of the adversarial strength as a function of the deviation angle

By choosing the origin at the midpoint between and , we can ensure that and . We then have:

 ∥i−m(i,C)∥ =∥i−i+2d(i,C)c∥ =2|d(i,C)| =2|i⋅c+c0| =2|cos(δc)(i⋅b)+sin(δc)(0i⋅b⊥c)+c0| =2|−∥i∥cos(δc)+c0|

Similarly, we have:

If we assume that lies between and , then we must have and:

By applying the law of cosines in the triangle , we have:

 ∥j−m(i,C)∥ =2√∥i∥2cos2(δc)+c20−2∥i∥cos(δc)c0+∥i∥2−2∥i∥2cos2(δc)+2∥i∥cos(δc)c0 =2√∥i∥2(1−cos2(δc))+c20 =2√∥i∥2sin2(δc)+c20

Similarly by applying the law of cosines in the triangle , we have:

Finally by posing , we can write:

#### B  Expression of the sets of all classification scores through C and Cθ

If we regard as a data matrix, then we can write:

 d(S,C) =S⋅c+c0 =S⋅(cos(θc)z⊥c+sin(θc)z)+c0 =cos(θc)(S⋅z⊥c)+sin(θc)(S⋅z)+c0 =cos(θc)(S⋅z⊥c+c0/cos(θc))+sin(θc)(S⋅z) =(cos(θc),sin(θc))⋅S⋅(z⊥c+c0/cos(θc),z)⊤ =V⋅P

With and .

Similarly we have:

With

#### C  Expression of roc(θ) when P follows a bivariate normal distribution

With covariance :

With covariance :

We have:

We also have:

Hence:

And:

 roc(θ) =roc(Z,Cθ,Σ2)−roc(Z,C,Σ2) =1π⎡⎣arctan⎛⎝√vzv⊥ztan(θc+θ)⎞⎠−arctan⎛⎝√vzv⊥ztan(θc)⎞⎠⎤⎦ =1π⎡⎣arctan⎛⎝√vzv⊥ztan(x)⎞⎠⎤⎦θc+θθc