# On the Connection Between Adversarial Robustness and Saliency Map Interpretability

Recent studies on the adversarial vulnerability of neural networks have shown that models trained to be more robust to adversarial attacks exhibit more interpretable saliency maps than their non-robust counterparts. We aim to quantify this behavior by considering the alignment between input image and saliency map. We hypothesize that as the distance to the decision boundary grows,so does the alignment. This connection is strictly true in the case of linear models. We confirm these theoretical findings with experiments based on models trained with a local Lipschitz regularization and identify where the non-linear nature of neural networks weakens the relation.

## Authors

• 11 publications
• 8 publications
• 13 publications
• 71 publications
• ### On Saliency Maps and Adversarial Robustness

A Very recent trend has emerged to couple the notion of interpretability...
06/14/2020 ∙ by Puneet Mangla, et al. ∙ 0

• ### Resilience of Bayesian Layer-Wise Explanations under Adversarial Attacks

We consider the problem of the stability of saliency-based explanations ...
02/22/2021 ∙ by Ginevra Carbone, et al. ∙ 0

• ### Does Interpretability of Neural Networks Imply Adversarial Robustness?

The success of deep neural networks is clouded by two issues that largel...
12/07/2019 ∙ by Adam Noack, et al. ∙ 0

• ### Analysis of Deep Networks for Monocular Depth Estimation Through Adversarial Attacks with Proposal of a Defense Method

In this paper, we consider adversarial attacks against a system of monoc...
11/20/2019 ∙ by Junjie Hu, et al. ∙ 0

• ### Inherent Adversarial Robustness of Deep Spiking Neural Networks: Effects of Discrete Input Encoding and Non-Linear Activations

In the recent quest for trustworthy neural networks, we present Spiking ...
03/23/2020 ∙ by Saima Sharmin, et al. ∙ 7

• ### Probabilistic Jacobian-based Saliency Maps Attacks

Machine learning models have achieved spectacular performances in variou...
07/12/2020 ∙ by António Loison, et al. ∙ 9

• ### On the Benefits of Attributional Robustness

Interpretability is an emerging area of research in trustworthy machine ...
11/29/2019 ∙ by Mayank Singh, et al. ∙ 11

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Despite impressive results in a variety of classification tasks (LeCun et al., 2015)

, even highly accurate neural network classifiers are plagued by a vulnerability to so-called

adversarial perturbations (Szegedy et al., 2014). These adversarial perturbations are small, often visually imperceptible perturbations to the network’s input, which however result in the network’s classification decision being changed. Such vulnerabilities may pose a threat to real-world deployments of automated recognition systems, especially in security-critical applications such as autonomous driving or banking. This has sparked a large number of publications related to both the creation of adversarial attacks (Goodfellow et al., 2014; Kurakin et al., 2016; Moosavi-Dezfooli et al., 2016) as well as defenses against these (see (Schott et al., 2018) for an overview). Apart from the application-focused viewpoint, the observed adversarial vulnerability offers non-obvious insights into the inner workings of neural networks. One particular method of defense is adversarial training (Madry et al., 2018), which aims to minimize a modified training objective. While this method – like all known approaches of defense – decreases the accuracy of the classifier, it is also successful in increasing the robustness to adversarial attacks, i.e. the perturbations need to be larger on average in order to change the classification decision.

(Tsipras et al., 2019) also notice that networks that are robustified in this way show interesting phenomena, which so far could not be explained. Neural networks usually exhibit very unstructured saliency maps (gradients of a classifier score with respect to the network’s input (Simonyan et al., 2013)) which barely relate to the input image. On the other hand, saliency maps of robustified classifiers tend to be far more interpretable, in that structures in the input image also emerge in the corresponding saliency map, as exemplified in Figure 1. (Tsipras et al., 2019) describe this as an ’unexpected benefit’ of adversarial robustness. In order to obtain a semantically meaningful visualization of the network’s classification decision in non-robustified networks, the saliency map has to be aggregated over many different points in the vicinity of the input image. This can be achieved either via averaging saliency maps of noisy versions of the image (Smilkov et al., 2017) or by integrating along a path (Sundararajan et al., 2017)

. Other approaches typically employ modified backpropagation schemes in order to highlight the discriminative portions of the image. Examples of this include

guided backpropagation (Springenberg et al., 2015) and deep Taylor decomposition (Montavon et al., 2017).

In this paper, we show that the interpretability of the saliency maps of a robustified neural network is not only a side-effect of adversarial training, but a general property enjoyed by networks with a high degree of robustness to adversarial perturbations. We first demonstrate this principle for the case of a linear, binary classifier and show that the ’interpretability’ is due to the image vector and the respective image gradient aligning. For the more general, non-linear case we empirically show that while this relationship is true on average, the linear theory and the non-linear reality do not always agree. We empirically demonstrate that the more linear the model is, the stronger the connection between robustness and alignment becomes.

## 2 Adversarial Robustness and Saliency Maps

Since adversarial perturbations are small perturbations that change the predicted class of a neural network, it makes sense to define the robustness towards adversarial perturbations via the distance of the unperturbed image to its nearest perturbed image, such that the classification is changed.

###### Definition 1.

Let (with finite) be a classifier over the normed vector space . We call

 ρ(x)=infe∈X{∥e∥:F(x+e)≠F(x)} (1)

the (adversarial) robustness of in the point . We call the (adversarial) robustness of over the distribution .

Put differently, the robustness of a classifier in a point is nothing but the distance to its closest decision boundary. Margin classifiers like support vector machines

(Cortes & Vapnik, 1995) seek to keep this distance large for the training set, usually in order to avoid overfitting. (Sokolić et al., 2017) and (Elsayed et al., 2018) also apply this principle to neural networks via regularization schemes. We point out that our definition of adversarial robustness does not depend on the ground truth class label and – given feasible computability – can approximately be calculated even on unlabelled data.
In the following, we will always assume to be a real, finite-dimensional vector space with the Euclidean norm. The proofs for the following theoretical statements are found in the appendix.

### 2.1 A Motivating Toy Example

We consider the toy case of a linear binary classifier with the so-called score function and fixed , where denotes the standard inner product on . A straightforward calculation (see appendix) shows that the adversarial robustness of is given by

 ρ(x)=|⟨x,z⟩|∥z∥=|⟨x,∇Ψz(x)⟩|∥∇Ψz(x)∥. (2)

Unless stated otherwise, we will always denote with the gradient with respect to . Note that , where is the angle between the vectors and . This implies that grows with the alignment of and and is maximized if and only if and are collinear.
This motivates the following definition.

###### Definition 2 (Alignment).

Let the binary classifier

 F:X→{−1,1}

be defined a.e. by , where is differentiable in . We then call the saliency map of with respect to in and

 α(x):=|⟨x,∇Ψ(x)⟩|∥∇Ψ(x)∥, (3)

the alignment with respect to in .

The alignment is a measure of how similar the input image and the saliency map are. If , and and are zero-centered, this coincides with the absolute value of their Pearson correlation. For a linear binary classifier, the alignment trivially increases with the robustness of the classifier.

Generalizing from the linear to the affine case leads to a classifier of the form , whose robustness in is

 ρ(x)=|⟨x,z⟩+b|∥z∥.

In this case the robustness and alignment do not coincide anymore. In order to connect these two diverging concepts, we offer two alternative viewpoints. On the one hand, we can trivially bound the robustness via the triangle inequality

 ρ(x)≤α(x)+|b|∥z∥. (4)

This is particularly meaningful if is small in comparison to . Alternatively, one can connect the robustness to the alignment at a different point , leading to the relation

 ρ(x)=α(ξ). (5)

In the affine case this approach simply amounts to a shift of the data that is uniform over all data points . We will see how these two viewpoints lead to different bounds in the non-linear case later.

### 2.2 The General Case

We now consider the general, -class case.

###### Definition 3 (Alignment, Multi-Class Case).

Let

 Ψ=(Ψ1,…,Ψn):X→Rn

be differentiable in . Then for an -class classifier defined a.e. by

 F(x)=argmaxiΨi(x), (6)

we call the saliency map of . We further call

 α(x):=|⟨x,∇ΨF(x)(x)⟩|∥∇ΨF(x)(x)∥, (7)

the alignment with respect to in .

#### 2.2.1 Linearized Robustness

In general the distance to the decision boundary

can be unfeasible to compute. However, for classifiers built on locally affine score functions – such as most neural networks using ReLU or leaky ReLU activations –

can easily be computed, provided the locally affine region is sufficiently large. To quantify this, define the radius of the locally affine component of around as

 l(x)=sup{r | ∀i:Ψi affine in Br(x)},

where is the open ball of radius around with respect to the Euclidean metric.

###### Lemma 1.

Let be a classifier with locally affine score function . Assume . Then

 ρ(x)=minj≠i∗Ψi∗(x)−Ψj(x)∥∇Ψi∗(x)−∇Ψj(x)∥, (8)

for the predicted class at .

Similar identities were previously also independently derived in (Elsayed et al., 2018) and (Jakubovitz & Giryes, 2018).

Note that while nearly all state-of-the art classification networks are piecewise affine, the condition is typically violated in practice. However, the lemma can still hold approximately as long as the linear approximation to the network’s score functions is sufficiently good in the relevant neighbourhood of . This motivates the definition of the linearized (adversarial) robustness .

###### Definition 4 (Linearized Robustness).

Let be the differentiable score vector for the classifier in . We call

 ~ρ(x):=minj≠i∗Ψi∗(x)−Ψj(x)∥∇Ψi∗(x)−∇Ψj(x)∥, (9)

the linearized robustness in , where is the predicted class at point .

We later show that the two notions lead to very similar results, even if the condition is violated.

#### 2.2.2 Reducing the Multi-Class Case

In this section, we introduce a toolset which helps bridge the gap between the alignment and the linearized robustness of a multi-class classifier. In the following, for fixed , let and be the minimizer in (9). We can assign in a binarized classifier with

 F†x(y):=sgn(Ψ†x(y)), (10)

where . Its linearized robustness in is the same as for . The binarized saliency map, and the respective alignment,

 α†(x)=|⟨x,∇(Ψi∗−Ψj∗)(x)⟩|∥∇(Ψi∗−Ψj∗)(x)∥, (11)

which we call binarized alignment, offer an alternative, natural perspective of the above considerations. This is because for classifiers as defined in (6), the actual score values do not necessarily carry any information about the classification decision, whereas the score differences do. While, roughly speaking, tells us what ’thinks’ makes a member of its predicted class, carries information what sets apart from its closest neighboring class (according to linearization).

In the special case of a linear, multi-class classifier, we have

 ρ(x)=~ρ(x)=α†(x)

and in the linear, binary case , even

 α(x)=α†(x).

## 3 Decompositions and Bounds for Neural Networks

### 3.1 Homogeneous Decomposition

In the previous chapter we have seen that in the case of binary classifiers, the robustness and binarized alignment coincide for linear score functions. However, requiring to be linear is a stronger assumption than necessary to deduce the result: It is in fact sufficient for to be positive one-homogeneous. Any such function satisfies for all and .

###### Lemma 2 (Linearized Robustness of Homogeneous Classifiers).

Consider a classifier with positive one-homogeneous score functions. Then

 ~ρ(x)=α†(x). (12)

In particular, most feedforward neural networks with (leaky) ReLU activations without biases are positive one-homogeneous. This observation motivates to split up any classifier built on neural networks into a homogeneous term and the corresponding remainder, leading to the following decomposition result.

###### Theorem 1 (Homogeneous Decomposition of Neural Networks).

Let

be any logit of a neural network with ReLU activations (of class

in the appendix). Denote by the linear filters and by the bias terms of the network. Then

 ΨiΘ,b(x)=⟨x,∇xΨiΘ,b(x)⟩+⟨b,∇bΨiΘ,b(x)⟩=⟨x,∇xΨiΘ,b(x)⟩+∑kbk∂bkΨiΘ,b(x). (13)

Note that the above vector

includes the running averages of the means for batch normalization. For ReLU networks, the remainder term

is locally constant, because it changes only when enters another locally linear region. For ease of notation, we will now drop the subscripts and .

### 3.2 Pointwise Bounds

In section 2.1, we introduced two different viewpoints for affine linear, binary classifiers which connect the robustness to the alignment. In a similar vein to inequality (4) and equality (5), upper bounds to the linearized robustness depending on the alignment can be given for neural networks. In the following, we will write for . Again, in the following we fix and write and for the minimizer in from equation (9).

###### Theorem 2.

Let . Furthermore, let and . Then

 ~ρ(x) ≤α†(x)+|β†|∥g†∥ (14) ≤α(x)+∥x∥⋅∥¯¯¯g†−¯¯¯g∥+|β†|∥g†∥. (15)

Distances on the unit sphere (such as ) can be converted to angles through the law of cosines. For the above inequalities to be reasonably tight, the angle between and needs to be small and needs to be small in comparison to . In this case, the alignment should roughly increase with the linearized robustness.

###### Theorem 3.

Let and , with and defined as in the previous theorem. Then

 ~ρ(x) ≤|⟨ξ,γ⟩|∥γ∥+∥ξ∥⋅∥¯¯¯g†−¯¯¯γ∥, (16)

 ~ρ(x) ≤α(ξ)+∥ξ∥⋅∥¯¯¯g†−¯¯¯γ∥.

Depending on the sign of , the shifted image can either be understood as a gradient ascent or descent iterate for maximizing/minimizing . This theorem assimilates into , providing an upper bound to that depends on . The sensibility of this hinges on being reasonably close to and having a low angle with .

If the error terms in inequalities (14), (15) and (16) are small, these inequalities thus provide a simple illustration why more robust networks yield more interpretable saliency maps.

Nevertheless, the right-hand side may be much larger than , if the inner product between an image and its respective saliency map are almost orthogonal. This is because the Cauchy-Schwarz inequality (see the proofs in the appendix) provides a large upper bound in this case. The inequalities rather serve as an explanation of how the various terms of alignment may deviate from the linearized robustness in the case of a neural network.

### 3.3 Alignment and Interpretability

The above considerations demonstrate how an increase in robustness may induce an increase in the alignment between an input image and its respective saliency map. The initial observation – which was previously described as an increase in interpretability – may thus be ascribed to this phenomenon. This is especially true in the case of natural images, as exemplified in Figure 1. There, what a human observer would deem an increase in interpretability, expresses itself as discriminative portions of the original image reappearing in the saliency map, which naturally implies a stronger alignment. The concepts of alignment and interpretability should however not be conflated completely: In the case of quasi-binary image data like MNIST, 0-regions of the image render the inner product in equation (7) invariant with respect to the saliency map in this region, even if the saliency map e.g. assigns relevance to the absence of a feature in this region. Note however that the saliency map in this region still influences the alignment term through the division by its norm. Additionally, the alignment is also not invariant to the images’ representation (color space, shifts, normalization etc.). Still, for most types of image data an increase in alignment in discriminative regions should coincide with an increase in interpretability.

## 4 Experiments

In order to validate our hypothesis, we trained several models of different adversarial robustness on both MNIST (LeCun et al., 1990)

and ImageNet

(Deng et al., 2009) using double backpropagation (Drucker & Le Cun, 1992). For a neural network with a softmax output layer, this amounts to minimizing the modified loss

 1NN∑i=1[L(fθ(x(i)),y(i))+λ⋅∥∇L(fθ(x(i)),y(i))∥2] (17)

over the parameters . Here, is the training set and

denotes the negative log-likelihood error function. The hyperparameter

determines the strength of the regularization. Note that this penalizes the local Lipschitz constant of the loss. As (Simon-Gabriel et al., 2018) demonstrate, double backpropagation makes neural networks more resilient to adversarial attacks. By varying , we can easily create models of different adversarial robustness for the same dataset, whose properties we can then compare. (Anil et al., 2018) previously noted that Lipschitz constrained networks exhibit interpretable saliency maps (without an explanation), which can be regarded as a side-effect of the increase in adversarial robustness.

For the MNIST experiments, we trained each of our 16 models on an NVIDIA 1080Ti GPU with a batch size of 100 for 200 epochs, covering the regularization hyperparameter range from 10 to 180,000, before the models start to degenerate. The used architecture is found in the appendix.

For the experiments on ImageNet, we fine-tuned the pre-trained ResNet50 model from (He et al., 2016)

over 35 epochs on 2 NVIDIA P100 GPUs with a total batch size of 32. We used stochastic gradient descent with a learning rate of 0.0001 and momentum of 0.99. The learning rate was divided by 10 whenever the error stopped improving. For the regularization parameter, we chose

. The experiments were implemented in Tensorflow

### 4.1 Robustness and Alignment

For checking the relation between the alignment and robustness of a neural network, we created 1000 adversarial examples per model on the respective validation set. This was realized using the python library Foolbox (Rauber et al., 2017), which offers pre-defined adversarial attacks, three of which we used in this paper: The GradientAttack performs a line search for the closest adversarial example along the direction of the loss gradient. L2BasicInterativeAttack implements the projected gradient descent attack from (Kurakin et al., 2016) for the Euclidean metric. Similarly, CarliniWagnerL2Attack (CW-attack) is the attack introduced in (Carlini & Wagner, 2017) suited for finding the closest adversarial example in Euclidean metric. Additionally, we calculated the linearized robustness , which entails calculating gradients per image for an -class problem.
In Figures 2 and 3, we investigate how the median alignment depends on the medians of the different conceptions of robustness. We opted in favor of the median (

) instead of the arithmetic mean due to its increased robustness to outliers, which occurred especially when using the gradient attack. In the case of ImageNet (Figure

2), an increase in median alignment with the median robustness is clearly visible for all three estimates of the robustness. On the other hand, the alignment for the MNIST data increases with the robustness as well, but seems to saturate at some point. We will offer an explanation for this phenomenon later.

We now consider the pointwise connection between robustness and alignment. In Figure 4 the two variables are highly-correlated for a model trained on MNIST, pointing towards the fact that the network behaves very similarly to a positive one-homogeneous function. There is however no visible correlation between them on the ImageNet model, which is a consistent behavior throughout the whole experiment cohort. We will later analyse the source of this behavior. The increase in median alignment for ImageNet, , can still be explained by a statistical argument: If , as approximately true in our ImageNet model, then is the median absolute deviation of . In other words, the graph for ImageNet in Figure 4 depicts the dispersion of . The above observations also hold well for the binarized alignment.

In Figure 5 a tight correlation between and becomes evident. Here, the latter has been calculated using the CW-attack. The linearized robustness model is hence an adequate approximation of the actual robustness , even for the highly non-linear neural network models used on ImageNet. Finally note that all used attacks lead to the same general behavior of all quantities investigated (see Figures 2 and 3).

### 4.2 Explaining the Observations

In the last section, we observed some commonalities between the experiments on ImageNet and MNIST, but also some very different behaviors. In particular, two aspects stand out: Why does the median alignment steadily increase for the observed ImageNet experiments, whereas on MNIST this stagnates at some point (Figures 2 and 3)? Furthermore, why are and so highly-correlated on MNIST but almost uncorrelated on ImageNet (Figure 4)? We turn to Theorems 2 and 3 for answers.
Theorem 2 states that

 ~ρ(x)≤α†(x)+|β†|∥g†∥, (18)

where is the locally constant term and is the saliency map of the binarized classifier and for . In Figure 6, we check how strongly the right-hand side of inequality (18) is dominated by , i.e. how large the influence of the locally linear term is in comparison to the locally constant term. For ImageNet, this ratio increases from below 0.55 to almost 0.85, pointing towards a model increasingly governed by its linearized part. On MNIST, this ratio strongly decreases over the robustness’s range. Note however that in the weakly regularized MNIST models, the right hand side is extremely dominated by the median alignment in the first place.

A similar analysis can be performed for the second inequality from Theorem 2,

 ~ρ(x)≤α(x)+∥x∥⋅∥¯¯¯g†−¯¯¯g∥+|β†|∥g†∥, (19)

which additionally makes a step from binarized alignment to (conventional) alignment.

This leads to an additional error term, making the bound significantly less tight than in the previous case. In particular, the proportion of the alignment on the right-hand side diminishes, confirming our prediction from section 3.2. Nevertheless, the qualitative behaviors is similar to the previous case, with the taking up an increasing fraction of the right-hand with increasing robustness. For MNIST data, the ratio varies little compared to the ratio from the last inequality. This indicates that the remainder term does not change too strongly over the set of MNIST experiments compared to . We thus deduce that the qualitative relationship between robustness and alignment is fully governed by the error term introduced in (18), i.e. the locally constant term of the logit.

We now do the same for the inequality in Theorem 3, which states that

 ~ρ(x) ≤|⟨ξ,γ⟩|∥γ∥+∥ξ∥⋅∥¯¯¯g†−¯¯¯γ∥ (20)

for and , which gets rid of the additive term from (18). Again, in the case of ImageNet grows more quickly in comparison to , the distance of the normalized gradients, whereas their ratio is approximately constant for MNIST data.

To conclude, we have seen that the upper bounds from Theorems 2 and 3 provide valuable information in which ways both the experiments on ImageNet and MNIST are influenced by the respective terms. In the case of ImageNet, we consistently see the alignment terms growing more quickly than the other terms. This might indicate that the growth in alignment stems not only from the growth in the robustness alone, but also from the model becoming increasingly similar to our idealized toy example. In other words, not only does the robustness make the alignment grow, but the connection between these two properties becomes stronger in the case of ImageNet. This is in agreement with the seemingly superlinear growth of the median alignment in Figure 2.
It is not surprising that a classifier for a problem as complex as ImageNet is highly non-linear, which makes the (pointwise) connection between alignment and robustness rather loose. We hence conjecture that the imposed regularization increasingly restricts the models to be more linear, thereby making them more similar to our initial toy example.
For MNIST, the regularization seems to have the opposite effect: As seen in Figure 6, the binarized alignment initially dwarfs the correction term introduced by the locally constant portion of the binarized logit . As the network becomes more robust, is apparently not dominated by the linear terms anymore, while the influence of the locally constant terms (i.e. ) increases. This hypothesis seems sensible, considering MNIST is a very simple problem which we tackled with a comparatively shallow network. This can be expected to yield a model with a low degree of non-linearity. The penalization of the local Lipschitz constant here seems to have the effect of requiring larger locally constant terms , in contrast to the models trained on ImageNet.

We check the validity of these claims by tracking the median size of against the median size of in Figure 9. On MNIST, starts out at approximately of and at the end rises to almost . Note that this does not indicate that is typically close to 0 for all , just that is, compared to .
On MNIST, this ratio is close to 1 up until , when it suddenly and quickly falls below . This drop is consistent with what we see in Figure 3: At around the same point this drop occurs, the alignment starts to saturate. While an increase in the model’s median robustness should imply an increase in the model’s median alignment, the deviation from linearity weakens the connection between robustness and alignment, such that the two effects roughly cancel out.

In Figure 10, we provide examples for the different gradient concepts we introduced in Theorems 2 and 3, both for the most robust and non-robust network from our experiment cohort.

## 5 Conclusion and Outlook

In this paper, we investigated the connection between a neural network’s robustness to adversarial attacks and the interpretability of the resulting saliency maps. Motivated by the binary, linear case, we defined the alignment as a measure of how much a saliency map matches its respective image. We hypothesized that the perceived increase in interpretability is due to a higher alignment and tested this hypothesis on models trained on MNIST and ImageNet. While on average, the proposed relation holds well, the connection is much less pronounced for individual points, especially on ImageNet. Using some upper bounds for the robustness of a neural network, which we derived using a decomposition theorem, we arrived at the conclusion that the strength of this connection is strongly linked with how similar to a linear model the neural network is locally. As ImageNet is a comparatively complex problem, any sufficiently accurate model is bound to be very non-linear, which explains the difference to MNIST.
While this paper shows the general link between robustness and alignment, there are still some open questions. Since we only used one specific robustification method, further experiments should determine the influence of this method. One could explore, whether a different choice of norm leads to different observations. Another future direction of research could be to investigate the degree of (non-)linearity and its connection to this topic. While Theorems 2 and 3 illustrate how the pointwise linearized robustness and alignment may diverge, depending on terms like , , and , a more in-depth look should focus on why and when these terms have a certain relationship to each other.

From a methodological standpoint, the discovered connection may also serve as an inspiration for new adversarial defenses, where not only the robustness but also the alignment is taken into account. One way of increasing the alignment directly would be through the penalty term

 λ(∥x∥2∥∇Ψi(x)∥2−⟨x,∇Ψi(x)⟩2),

which is bounded from below by 0 via the Cauchy-Schwarz inequality. Any robustifying effects of the increased alignment may however be confounded with the Lipschitz-penalty that the first summand effectively introduces, which necessitates a careful experimental evaluation.

## Acknowledgements

CE and PM acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 281474342: ’RTG - Parameter Identification - Analysis, Algorithms, Applications’. The work by SL was supported by the EPSRC grant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge Centre for Analysis and by the Cantab Capital Institute for the Mathematics of Information. CBS acknowledges support from the Leverhulme Trust projects on Breaking the non-convexity barrier and on Unveiling the Invisible, the Philip Leverhulme Prize, the EPSRC grant Nr. EP/M00483X/1, the EPSRC Centre Nr. EP/N014588/1, the European Union Horizon 2020 research and innovation programmes under the Marie Skodowska-Curie grant agreement No 777826 NoMADS and No 691070 CHiPS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Quadro P6000 and a Titan Xp GPUs used for this research.

## Appendix

##### Proof of Equation (3):

Note that

 F(x+e)≠F(x) ⇔ ⟨x+e,z⟩⟨x,z⟩<0 ⇔ ⟨e,z⟩>|⟨x,z⟩|.

The left-hand side is clearly maximized for , leading to

 ∥e∥∥z∥>|⟨x,z⟩|.

This proves the claim by taking the infimum over .

See 1

###### Proof.

As , we can take the infimum in (1) over all perturbations in the local affine component, i.e. with only. This allows us to reformulate

 F(x+e)≠F(x) ⇔ ∃j≠i∗:Ψj(x+e)>Ψi∗(x+e) ⇔ ∃j≠i∗:⟨∇Ψj(x)−∇Ψi∗(x),e⟩>Ψi∗(x)−Ψj(x).

The infimum over is achieved by choosing as a multiple of . A direct computation then finishes the proof. ∎

### Proofs of Homogenization results

###### Lemma 3 (Euler’s Homogeneous Function Theorem).

Let be a positive one-homogeneous function that is continuously differentiable on . Then

 f(x)=⟨∇f(x),x⟩
###### Proof.

First note that

 ∂if(ax) =limt→0f(ax+tei)−f(ax)t =limt→0f(ax+atei)−f(ax)at=∂if(x).

Hence

 f(x)=∫10⟨∇f(tx),x⟩ dt=⟨∇f(x),x⟩

See 2

###### Proof.

Direct consequence of 3. ∎

###### Definition 5 (Neural Networks).

Define the class of neural networks to be any network built on learnable affine transforms (convolutional layers, dense layers) with linear weights and biases and ReLU or leaky ReLU activations. The network can include arbitrary skip-connections, batch-normalization layers and max or average pooling layers of arbitrary window size. This in particular includes many state-of-the-art classification networks.

###### Lemma 4 (Homogeneous Networks).

For fixed , consider the logit of a network , where denotes the linear weights and

the bias vector of the network. Then the function

 f:y↦ΨiΘ,b∥y∥∥x∥(y),

is positive one-homogeneous and .

###### Proof.

Consider first a network consisting of a single layer with linear transform

and bias with ReLU non-linearity. The associated network function is hence given by . For this network, we compute for fixed and any and as

 f(ay) =(A(ay)+b∥ay∥∥x∥)+ =(a⋅Ay+a⋅b∥y∥∥x∥)+=af(y).

A single layer is hence positive one-homogeneous. A function consisting of compositions of positive one-homogeneous functions is positive one-homogeneous itself as well, the function associated to a network consisting of affine transforms and ReLU activations is positive one-homogeneous. All of the operations skip-connections, batch-normalization layers and max or average pooling are positive one-homogeneous as well, thus proving the claim. ∎

See 1

###### Proof.

Let be the functions associated with the network as in Lemma 4. Then by Lemma 3 we can compute the value of at the point via

 f(x)=⟨x,∇yf(y)|y=x⟩.

Note that by construction . We compute the gradient of at the point explicitly as

 ∇yf(y)|y=x=∇xΨiΘ,b(x)+x∥x∥2⟨b,∇bΨiΘ,b(x)⟩.

Combining these results shows

 f(x) =⟨x,∇xΨiΘ,b(x)+x∥x∥2⟨b,∇bΨiΘ,b(x)⟩⟩ =⟨x,∇xΨiΘ,b(x)⟩+⟨b,∇bΨiΘ,b(x)⟩.

Recall the notation and for the minimizer in in (9). See 2

###### Proof.

We have

 ~ρ(x) =Ψi∗(x)−Ψj∗(x)∥∇Ψi∗(x)−∇Ψj∗(x)∥ =⟨x,∇Ψi∗(x)−∇Ψj∗(x)⟩+βi∗(x)−βj∗(x)∥∇Ψi∗(x)−∇Ψj∗(x)∥ =∣∣∣⟨x,¯¯¯g†⟩+β†∥g†∥∣∣∣≤α†(x)+|b†|∥g†∥,

using the decomposition theorem and the triangle inequality. Further,

 α†(x)+|b†|∥g†∥ = ∣∣⟨x,¯¯¯g†⟩∣∣+|b†|∥g†∥ = ∣∣⟨x,¯¯¯g†−¯¯¯g+¯¯¯g⟩∣∣+|b†|∥g†∥ ≤ ∣∣⟨x,¯¯¯g⟩∣∣+∣∣⟨x,¯¯¯g†−¯¯¯g⟩∣∣+|b†|∥g†∥ ≤ .α(x)+∥x∥⋅∥¯¯¯g†−¯¯¯g∥+|b†|∥g†∥,

using the Cauchy-Schwarz inequality. ∎

See 3

###### Proof.

We have

 ~ρ(x) =⟨x,g†⟩+β†⟨g†∥g†∥2,g†⟩∥g†∥ =⟨x+β†∥g†∥g†∥g†∥,g†⟩∥g†∥ =⟨ξ,¯¯¯g†⟩=⟨ξ,¯¯¯g†−¯¯¯g+¯¯¯g⟩ ≤|⟨ξ,¯¯¯γ⟩|+∥ξ∥⋅∥¯g†−¯¯¯γ∥,

using the Cauchy-Schwarz inequality in the same way as in the last theorem. ∎

### MNIST Model Architecture

Here we describe the architecture that was used for the MNIST models.

Conv2D (, ’same’), 32 feature maps, ReLU
Max Pooling (factor 2)
Conv2D (, ’same’), 64 feature maps, ReLU
Max Pooling (factor 2)
Conv2D (, ’same’), 128 feature maps, ReLU
Max Pooling (factor 2)

Dense Layer (128 neurons), ReLU

Dropout ()
Softmax