# Adversarial Robustness through Local Linearization

Adversarial training is an effective methodology for training deep neural networks that are robust against adversarial, norm-bounded perturbations. However, the computational cost of adversarial training grows prohibitively as the size of the model and number of input dimensions increase. Further, training against less expensive and therefore weaker adversaries produces models that are robust against weak attacks but break down under attacks that are stronger. This is often attributed to the phenomenon of gradient obfuscation; such models have a highly non-linear loss surface in the vicinity of training examples, making it hard for gradient-based attacks to succeed even though adversarial examples still exist. In this work, we introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness. We show via extensive experiments on CIFAR-10 and ImageNet, that models trained with our regularizer avoid gradient obfuscation and can be trained significantly faster than adversarial training. Using this regularizer, we exceed current state of the art and achieve 47 ImageNet with l-infinity adversarial perturbations of radius 4/255 under an untargeted, strong, white-box attack. Additionally, we match state of the art results for CIFAR-10 at 8/255.

## Authors

• 5 publications
• 12 publications
• 12 publications
• 15 publications
• 2 publications
• 3 publications
• 17 publications
• 14 publications
• 9 publications
• 79 publications

04/29/2019 ∙ by Ali Shafahi, et al. ∙ 0

• ### Decoupling Direction and Norm for Efficient Gradient-Based L2 Adversarial Attacks and Defenses

11/23/2018 ∙ by Jerome Rony, et al. ∙ 0

We propose a novel data-dependent structured gradient regularizer to inc...
05/22/2018 ∙ by Kevin Roth, et al. ∙ 2

• ### What it Thinks is Important is Important: Robustness Transfers through Input Gradients

Adversarial perturbations are imperceptible changes to input pixels that...
12/11/2019 ∙ by Alvin Chan, et al. ∙ 27

• ### Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks

It has been widely recognized that adversarial examples can be easily cr...
09/25/2019 ∙ by Tianyu Pang, et al. ∙ 0

• ### Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training

We propose Max-Margin Adversarial (MMA) training for directly maximizing...
12/06/2018 ∙ by Gavin Weiguang Ding, et al. ∙ 0

Input gradient regularization is not thought to be an effective means fo...
05/27/2019 ∙ by Chris Finlay, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In a seminal paper, Szegedy et al. [23] demonstrated that neural networks are vulnerable to visually imperceptible but carefully chosen adversarial perturbations which cause neural networks to output incorrect predictions. After this revealing study, a flurry of research has been conducted with the focus of making networks robust against such adversarial perturbations [14, 16, 19, 26]. Concurrently, researchers devised stronger attacks that expose previously unknown vulnerabilities of neural networks [25, 4, 1, 3].

Of the many approaches proposed [20, 2, 6, 22, 15, 19], adversarial training [14, 16] is empirically the best performing algorithm to train networks robust to adversarial perturbations. However, the cost of adversarial training becomes prohibitive with growing model complexity and input dimensionality. This is primarily due to the cost of computing adversarial perturbations, which is incurred at each step of adversarial training. In particular, for each new mini-batch one must perform multiple iterations of a gradient-based optimizer on the network’s inputs to find said perturbations.111While computing the globally optimal adversarial example is NP-hard [11], gradient descent with several random restarts was empirically shown to be quite effective at computing adversarial perturbations of sufficient quality. As each step of this optimizer requires a new backwards pass, the total cost of adversarial training scales as roughly the number of such steps. Unfortunately, effective adversarial training of ImageNet often requires large number of steps to avoid problems of gradient obfuscation [1, 25], making it much more expensive than conventional training, almost prohibitively so.

One approach which can alleviate the cost of adversarial training is training against weaker adversaries that are cheaper to compute. For example, by taking fewer gradient steps to compute adversarial examples during training. However, this can produce models which are robust against weak attacks, but break down under strong attacks – often due to gradient obfuscation. In particular, one form of gradient obfuscation occurs when the network learns to fool a gradient based attack by making the loss surface highly convoluted and non-linear (see Fig 1), which in turn prevents gradient based optimization methods from finding an adversarial perturbation within a small number of iterations [4, 25]. In contrast, if the loss surface was linear in the vicinity of the training examples, which is to say well-predicted by local gradient information, gradient obfuscation cannot occur. In this paper, we take up this idea and introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data. We call this regularizer the local linearity regularizer (LLR). Empirically, we find that networks trained with LLR exhibit far less gradient obfuscation, and are almost equally robust against strong attacks as they ares against weak attacks.

The main contributions of our paper are summarized below:

• We show that training with LLR is significantly faster than adversarial training, allowing us to train a robust ImageNet model with a speed up when training on 128 TPUv3 cores [9].

• We show that LLR trained models exhibit higher robustness relative to adversarially trained models when evaluated under strong attacks. Adversarially trained models can exhibit a decrease in accuracy of 6% when increasing the attack strength at test time for CIFAR-10, whereas LLR shows only a decrease of 2%.

• We achieve new state of the art results for adversarial accuracy against untargeted white-box attack for ImageNet (with 222This means that every pixel is perturbed independently by up to 4 units up or down on a scale where pixels take values ranging between 0 and 255.): . Furthermore, we match state of the art results for CIFAR 10 (with ): 333We note that TRADES [28] gets 55% against a much weaker attack; under our strongest attack, it gets 52.5%..

• We perform a large scale evaluation of existing methods for adversarially robust training under consistent, strong, white-box attacks. For this we recreate several baseline models from the literature, training them both for CIFAR-10 and ImageNet (where possible).444Baselines created are adversarial training, TRADES and CURE [19]. To the contrary of CIFAR-10, we are currently unable to achieve consistent and competitive results on ImageNet at using TRADES.

## 2 Background and Related Work

We denote our classification function by , mapping input features

to the output logits for classes in set

, i.e. , with being the model parameters and being the label. Adversarial robustness for is defined as follows: a network is robust to adversarial perturbations of magnitude at input if and only if

 argmaxi∈Cfi(x;θ)=argmaxi∈Cfi(x+δ;θ)∀δ∈Bp(ϵ)={δ:∥δ∥p≤ϵ}. (1)

In this paper, we focus on and we use to denote for brevity. Given the dataset is drawn from distribution

, the standard method to train a classifier

is empirical risk minimization (ERM), which is defined by: Here,

is the standard cross-entropy loss function defined by

 ℓ(x;y,θ)=−yTlog(p(x;θ)), (2)

where is defined as above, and

is a 1-hot vector representing the class label. While ERM is effective at training neural networks that perform well on holdout test data, the accuracy on the test set goes to zero under adversarial evaluation. This is a result of a distribution shift in the data induced by the attack. To rectify this, adversarial training

[19, 14] seeks to perturb the data distribution by performing adversarial attacks during training. More concretely, adversarial training minimizes the loss function

 E(x,y)∼D[maxδ∈B(ϵ)ℓ(x+δ;y,θ)], (3)

where the inner maximization, , is typically performed via a fixed number of steps of a gradient-based optimization method. One such method is Projected-Gradient-Descent (PGD) which performs the following gradient step:

 δ←Proj(δ−η∇δℓ(x+δ;y,θ)), (4)

where . Another popular gradient-based method is to use the sign of the gradient [8]. The cost of solving Eq (3) is dominated by the cost of solving the inner maximization problem. Thus, the inner maximization should be performed efficiently to reduce the overall cost of training. A naive approach is to reduce the number of gradient steps performed by the optimization procedure. Generally, the attack is weaker when we do fewer steps. If the attack is too weak, the trained networks often display gradient obfuscation as highlighted in Fig 1.

Since the invention of adversarial training, a corpus of work has researched alternative ways of making networks robust. One such approach is the TRADES method [28]

which is a form of regularization that specifically maximizes the trade-off between robustness and accuracy – as many studies have observed these two quantities to be at odds with each other

[24]. Others, such as work by Ding et al [7] adaptively increase the perturbation radius by find the minimal length perturbation which changes the output label. Some have proposed architectural changes which promote adversarial robustness, such as the "denoise" model [26] for ImageNet.

The work presented in this paper is closely related to the paper by Moosavi et al [19], which highlights that adversarial training reduces the curvature of with respect to . Leveraging an empirical observation (the highest curvature is along the direction ), they further propose an algorithm to mimic the effects of adversarial training on the loss surface. The algorithm results in comparable performance to adversarial training with a significantly lower cost.

## 3 Motivating the Local Linearity Regularizer

As described above, the cost of adversarial training is dominated by solving the inner maximization problem . Throughout we abbreviate with . We can reduce this cost simply by reducing the number of PGD (as defined in Eq (4)) steps taken to solve . To motivate the local linearity regularizer (LLR), we start with an empirical analysis of how the behavior of adversarial training changes as we increase the number of PGD steps used during training. We find that the loss surface becomes increasingly linear as we increase the number of PGD steps, captured by the local linearity measure defined below.

### 3.1 Local Linearity Measure

Suppose that we are given an adversarial perturbation . The corresponding adversarial loss is given by . If our loss surface is smooth and approximately linear, then is well approximated by its first-order Taylor expansion . In other words, the absolute difference between these two values,

 g(δ;x)=∣∣ℓ(x+δ)−ℓ(x)−δT∇xℓ(x)∣∣, (5)

is an indicator of how linear the surface is. Consequently, we consider the quantity

 γ(ϵ,x)=maxδ∈B(ϵ)∣∣ℓ(x+δ)−ℓ(x)−δT∇xℓ(x)∣∣, (6)

to be a measure of how linear the surface is within a neighbourhood . We call this quantity the local linearity measure.

### 3.2 Empirical Observations on Adversarial Training

We measure for networks trained with adversarial training on CIFAR-10, where the inner maximization is performed with 1, 2, 4, 8 and 16 steps of PGD. is measured throughout training on the training set555To measure we find with 50 steps of PGD using Adam as the optimizer and 0.1 as the step size.. The architecture used is a wide residual network [27] 28 in depth and 10 in width (Wide-ResNet-28-10). The results are shown in Fig 1(a) and 1(b). Fig 1(a) shows that when we train with one and two steps of PGD for the inner maximization, the local loss surface is extremely non-linear. An example visualization of such a loss surface is given in Fig 4(a). However, when we train with four or more steps of PGD for the inner maximization, the surface is relatively well approximated by as shown in Fig 1(b). An example of the loss surface is shown in Fig 4(b). For the adversarial accuracy of the networks, see Table 4.

## 4 Local Linearity Regularizer (LLR)

From the section above, we make the empirical observation that the local linearity measure decreases as we train with stronger attacks666Here, we imply an increase in the number of PGD steps for the inner maximization .. In this section, we give some theoretical justifications of why local linearity correlates with adversarial robustness, and derive a regularizer from the local linearity measure that can be used for training of robust models.

### 4.1 Local Linearity Upper Bounds Adversarial Loss

The following proposition establishes that the adversarial loss is upper bounded by the local linearity measure, plus the change in loss as predicted by the gradient (which is given by ).

###### Proposition 4.1.

Consider a loss function that is once-differentiable, and a local neighbourhood defined by . Then for all

 |ℓ(x+δ)−ℓ(x)|≤|δT∇xℓ(x)|+γ(ϵ,x). (7)

See Appendix B for the proof.

From Eq (7) it is clear that the adversarial loss tends to , i.e., , as both and for all . And assuming one also has the upper bound .

### 4.2 Local Linearity Regularization (LLR)

Following the analysis above, we propose the following objective for adversarially robust training

 L(D)=ED[ℓ(x)+μ|δTLLR∇xℓ(x)|+λγ(ϵ,x)LLR], (8)

where and are hyper-parameters to be optimized, and (recall the definition of from Eq (5)). Concretely, we are trying to find the point in where the linear approximation is maximally violated. To train we penalize both its linear violation and the gradient magnitude term , as required by the above proposition. We note that, analogous to adversarial training, LLR requires an inner optimization to find – performed via gradient descent. However, as we will show in the experiments, much fewer optimization steps are required for the overall scheme to be effective. Pseudo-code for training with this regularizer is given in Appendix E.

### 4.3 Local Linearity γ(ϵ;x) is a sufficient regularizer by itself

Interestingly, under certain reasonable approximations and standard choices of loss functions, we can bound in terms of . See Appendix C for details. Consequently, the bound in Eq (7) implies that minimizing (along with the nominal loss ) is sufficient to minimize the adversarial loss . This prediction is confirmed by our experiments. However, our experiments also show that including in the objective along with and works better in practice on certain datasets, especially ImageNet. See Appendix F.3 for details.

## 5 Experiments and Results

We perform experiments using LLR on both CIFAR-10 [13] and ImageNet [5] datasets. We show that LLR gets state of the art adversarial accuracy on CIFAR-10 (at ) and ImageNet (at ) evaluated under a strong adversarial attack. Moreover, we show that as the attack strength increases, the degradation in adversarial accuracy is more graceful for networks trained using LLR than for those trained with standard adversarial training. Further, we demonstrate that training using LLR is faster for ImageNet. Finally, we show that, by linearizing the loss surface, models are less prone to gradient obfuscation.

CIFAR-10: The perturbation radius we examine is and the model architectures we use are Wide-ResNet-28-8, Wide-ResNet-40-8 [27]. Since the validity of our regularizer requires

to be smooth, the activation function we use is softplus function

, which is a smooth version of ReLU. The baselines we compare our results against are adversarial training (ADV)

[16], TRADES [28] and CURE [19]. We recreate these baselines from the literature using the same network architecture and activation function. The evaluation is done on the full test set of 10K images.

ImageNet: The perturbation radii considered are and . The architecture used for this is from [10] which is ResNet-152. We use softplus as activation function. For , the baselines we compare our results against is our recreated versions of ADV [16] and denoising model (DENOISE) [26].777We attempted to use TRADES on ImageNet but did not manage to get competitive results. Thus they are omitted from the baselines. For , we compare LLR to ADV [16] and DENOISE [26] networks which have been published from the the literature. Due to computational constraints, we limit ourselves to evaluating all models on the first 1K images of the test set.

To make sure that we have a close estimate of the true robustness, we evaluate all the models on a wide range of attacks these are described below.

### 5.1 Evaluation Setup

To accurately gauge the true robustness of our network, we tailor our attack to give the lowest possible adversarial accuracy. The two parts which we tune to get the optimal attack is the loss function for the attack and its corresponding optimization procedure. The loss functions used are described below, for the optimization procedure please refer to Appendix F.1.

Loss Functions: The three loss functions we consider are summarized in Table 1. We use the difference between logits for the loss function rather than the cross-entropy loss as we have empirically found the former to yield lower adversarial accuracy.

### 5.2 Results for Robustness

For CIFAR-10, the main adversarial accuracy results are given in Table 2. We compare LLR training to ADV [16], CURE [19] and TRADES [28], both with our re-implementation and the published models 888Note the network published for TRADES [28] uses a Wide-ResNet-34-10 so this is not shown in the table but under the same rigorous evaluation we show that TRADES get 84.91% nominal accuracy, 53.41% under Untargeted and 52.58% under Multi-Targeted.. Note that our re-implementation using softplus activations perform at or above the published results for ADV, CURE and TRADES. This is largely due to the learning rate schedule used, which is the similar to the one used by TRADES [28].

Interestingly, for adversarial training (ADV), using the Multi-Targeted attack for evaluation gives significantly lower adversarial accuracy compared to Untargeted. The accuracy obtained are and respectively. Evaluation using Multi-Targeted attack consistently gave the lowest adversarial accuracy throughout. Under this attack, the methods which stand out amongst the rest are LLR and TRADES. Using LLR we get state of the art results with adversarial accuracy.

For ImageNet, we compare against adversarial training (ADV) [16] and the denoising model (DENOISE) [26]. The results are shown in Table 3. For a perturbation radius of 4/255, LLR gets 47% adversarial accuracy under the Untargeted attack which is notably higher than the adversarial accuracy obtained via adversarial training which is 39.70%. Moreover, LLR is trained with just two-steps of PGD rather than 30 steps for adversarial training. The amount of computation needed for each method is further discussed in Sec 5.2.1.

Further shown in Table 3 are the results for . We note a significant drop in nominal accuracy when we train with LLR to perturbation radius 16/255. When testing for perturbation radius of 16/255 we also show that the adversarial accuracy under Untargeted is very poor (below 8%) for all methods. We speculate that this perturbation radius is too large for the robustness problem. Since adversarial perturbations should be, by definition, imperceptible to the human eye, upon inspection of the images generated using an adversarial attack (see Fig 8) - this assumption no longer holds true. The images generated appear to consist of super-imposed object parts of other classes onto the target image. This leads us to believe that a more fine-grained analysis of what should constitute "robustness for ImageNet" is an important topic for debate.

#### 5.2.1 Runtime Speed

For ImageNet, we trained on 128 TPUv3 cores [9]

, the total training wall time for the LLR network (4/255) is 7 hours for 110 epochs. Similarly, for the adversarially trained (ADV) networks the total wall time is 36 hours for 110 epochs. This is a

speed up.

#### 5.2.2 Accuracy Degradation: Strong vs Weak Evaluation

The resulting model trained using LLR degrades gracefully in terms of adversarial accuracy when we increase the strength of attack, as shown in Fig 3. In particular, Fig 2(a) shows that, for CIFAR-10, when the attack changes from Untargeted to Multi-Targeted, the LLR’s accuracy remains similar with only a drop in accuracy. Contrary to adversarial training (ADV), where we see a drop in accuracy. We also see similar trends in accuracy in Table 2. This could indicate that some level of obfuscation may be happening under standard adversarial training.

As we empirically observe that LLR evaluates similarly under weak and strong attacks, we hypothesize that this is because LLR explicitly linearizes the loss surface. An extreme case would be when the surface is completely linear - in this instance the optimal adversarial perturbation would be found with just one PGD step. Thus evaluation using a weak attack is often good enough to get an accurate gauge of how it will perform under a stronger attack.

For ImageNet, see Fig 2(b), the adversarial accuracy trained using LLR remains significantly higher (7.5%) than the adversarially trained network going from a weak to a stronger attack.

### 5.3 Resistance to Gradient Obfuscation

We use either the standard adversarial training objective (ADV-1, ADV-2) or the LLR objective (LLR-1, LLR-2) and taking one or two steps of PGD to maximize each objective. To train LLR-1/2, we only optimize the local linearity , i.e. in Eq. (8) is set to zero. We see that for adversarial training, as shown in Figs 3(a)3(c), the loss surface becomes highly non-linear and jagged – in other words obfuscated. Additionally in this setting, the adversarial accuracy under our strongest attack is for both, see Table 6. In contrast, the loss surface is smooth when we train using LLR as shown in Figs 3(b), 3(d). Further, Table 6 shows that we obtain an adversarial accuracy of with the LLR-2 network under our strongest evaluation. We also evaluate the values of for the CIFAR-10 test set after these networks are trained. This is shown in Fig 7. The values of are comparable when we train with LLR using two steps of PGD to adversarial training with 20 steps of PGD. By comparison, adversarial training with two steps of PGD results in much larger values of .

## 6 Conclusions

We show that, by promoting linearity, deep classification networks are less susceptible to gradient obfuscation, thus allowing us to do fewer gradient descent steps for the inner optimization. Our novel linearity regularizer promotes locally linear behavior as justified from a theoretical perspective. The resulting models achieve state of the art adversarial robustness on the CIFAR-10 and Imagenet datasets, and can be trained faster than regular adversarial training.

## Appendix B Local Linearity Upper Bounds Robustness: Proof of Proposition 4.1

Proposition 4.1. Consider a loss function that is once-differentiable, and a local neighbourhood defined by . Then for all

 |ℓ(x+δ)−ℓ(x)|≤|δ∇xℓ(x)|+γ(ϵ,x).
###### Proof.

Firstly we note that can be rewritten as the following:

 |ℓ(x+δ)−ℓ(x)|=∣∣δT∇xℓ(x)+ℓ(x+δ)−ℓ(x)−δT∇xℓ(x)∣∣.

Thus we can form the following bound:

 |ℓ(x+δ)−ℓ(x)|≤∣∣δT∇xℓ(x)∣∣+g(δ;x),

where . We note that since therefore for all

 |ℓ(x+δ)−ℓ(x)|≤∣∣δT∇xℓ(x)∣∣+γ(ϵ,x).

## Appendix C Local Linearity γ(ϵ,x) is a sufficient regularizer by itself

### c.1 A local quadratic model of the loss

The starting point for proving our bounds will be the following local quadratic approximation of the loss:

 ℓ(x+δ)=ℓ(x)+δ⊤∇xℓ(x)+12δ⊤G(x)δ+ε(δ), (9)

Here, is the Generalized Gauss-Newton matrix (GGN) [21, 18], and denotes the error of the approximation.

The GGN is a Hessian-alternative which appears frequently in approximate 2nd-order optimization algorithms for neural networks. It is defined for losses of the form , where is convex in . (Valid examples for include the standard softmax cross-entropy error and squared error.) It’s given by

 G(x)=J⊤HνJ,

where is the Jacobian of , and is the Hessian of with respect to .

One interpretation of the GGN is that it’s the Hessian of a modified loss , where is the local linear approximation of (given by ). For certain standard loss functions (including the ones we consider) it also corresponds to the Fisher Information Matrix associated with the network’s predictive distribution [18].

In the context of optimization, the local quadratic approximation induced by the GGN tends to work better than the actual 2nd-order Taylor series [e.g. 17], perhaps because it gives a better approximation to over non-trivial distances [18]. (It must necessarily be a worse approximation for very small values of , since the 2nd-order Taylor series is clearly optimal in that respect.)

### c.2 Bounds for common loss functions

Our basic strategy in proving the following results is to rearrange Eq (9) to establish the following bound on the curvature in terms of which is defined in Eq (5) in the main text:

 12δ⊤G(x)δ = ℓ(x+δ)−(ℓ(x)+δ⊤∇xℓ(x))−ε(δ) (10) ≤ |ℓ(x+δ)−(ℓ(x)+δ⊤∇xℓ(x))|+|ε(δ)| = g(δ;x)+|ε(δ)|.

We then show that for both the squared error and softmax cross-entropy loss functions, one can bound in terms of the curvature and by extension is bounded by the local linearity measure: . Note that such a bound won’t exist for general loss functions.

###### Proposition C.1.

Suppose that is the squared error and is the output of the neural network. Then for any perturbation vector we have

 |δ⊤∇xℓ(x)|≤2√2ℓ(x)(γ(ϵ;x)+|ε(δ)|),

where is the error of the local quadratic approximation defined in Equation 9.

###### Proposition C.2.

Suppose that is the softmax cross-entropy error, where is a 1-hot target vector, and

is the vector of probabilities computed via the softmax function. Then for any perturbation vector

we have

 |δ⊤∇xℓ(x)|≤√2y⊤p(z)(γ(ϵ;x)+|ε(δ)|),

where is the error of the local quadratic approximation defined in Equation 9.

###### Remark.

We note is just the probability of the target label under the model. And so won’t be very big, provided that the model is properly classifying the data with some reasonable degree of certainty. (Indeed, for highly certain predictions it will be close to .) Thus the upper bound given in Proposition C.2 should shrink at a reasonable rate as the regularizer does, provided that error term is negligable.

## Appendix D Proofs

### d.1 Proof of Proposition c.1

###### Proof.

For convenience we will write , where we have defined .

We observe that for the squared error loss, and (because ).

Thus by Equation 10 we have

 ∥Jδ∥2=δ⊤J⊤Jδ=δ⊤G(x)δ≤2(g(δ;x)+|ε(δ)|)≤2(γ(ϵ;x)+|ε(δ)|).

Using these facts, and applying the Cauchy-Schwarz inequality, we get

 |δ⊤∇xℓ(x)|2 = |−δ⊤J⊤r|2 = |(Jδ)⊤r|2 ≤ ∥Jδ∥2∥r∥2 ≤ 8(γ(ϵ;x)+|ε(δ)|)ℓ(x).

Taking the square root of both sides yields the claim. ∎

### d.2 Proof of Proposition c.2

###### Proof.

We begin by defining , and observing that for the softmax cross-entropy loss, , and where

 Hν(z)=diag(p)−pp⊤.

Because the entries of are non-negative and sum to we can factor this as

and where is defined as the entry-wise square root of the vector . To see that this is correct, note that

 CC⊤ = (diag(q)−pq⊤)(diag(q)−pq⊤)⊤ = diag(q)2−diag(q)qp⊤−pq⊤diag(q)+pq⊤qp⊤ = diag(p)−pp⊤−pp⊤+pp⊤ = Hν,

where we have used the properties of and , such as , , etc.

Using this factorization we can rewrite the curvature term as

 δ⊤G(x)δ=δ⊤J⊤Hν(z)Jδ=Δz⊤Hν(z)Δz=Δz⊤CC⊤Δz=∥C⊤Δz∥2,

where we have defined (intuitively, this is “the change in due to ”). Thus by Equation 10 we have

 ∥C⊤Δz∥2≤2(g(δ;x)+|ε(δ)|)≤2(γ(ϵ;x)+|ε(δ)|).

Let , which is well defined because is entry-wise positive (since must be), and is a one-hot vector. Using said properties of and we have that

 Cv = (diag(q)−pq⊤)1q⊤yy = 1q⊤yq⊙y−1q⊤yp(q⊤y) = y−p=r,

where denotes the entry-wise product.

It thus follows that

 δ⊤∇xℓ(x)=−δ⊤J⊤r=z⊤r=Δz⊤(Cv)=(C⊤Δz)⊤v.

Using the above facts, and applying the Cauchy-Schwarz inequality, we arrive at

 |δ⊤∇xℓ(x)|2=|(C⊤Δz)⊤v|2 ≤ ∥C⊤Δz∥2∥v∥2 ≤ 2(γ(ϵ;x)+|ε(δ)|)1(q⊤y)2∥y∥2 = 2p⊤y(γ(ϵ;x)+|ε(δ)|),

where we have used the facts that and . Taking the square root of both sides yields the claim. ∎

Note .

## Appendix F Experiments and Results: Supplementary

### f.1 Evaluation Setup

Optimization: Rather than using the sign of the gradient (FGSM) [8], we do the update steps using Adam [12] as the optimizer. More concretely, the update on the adversarial perturbation is . We have consistently found that using Adam gives a stronger attack compared to the sign of the gradient. For Multi-Targeted (see Table 1), the step size is set to be and we run for 200 steps. For Untargeted and Random-Targeted, we use a step size schedule setting up until 100 steps then 0.01 up until 150 steps and 0.001 for the last 50 steps. We find these to give us the best adversarial accuracy evaluation, the decrease in step size is especially helpful in cases where the gradient is obfuscated. Furthermore, we use 20 different random initialization (we term this a random restart) of the perturbation, , for going through the optimization procedure. We consider an attack successful if any of these 20 random restarts is successful. For CIFAR-10 we also show results for FGSM with 20 steps (FGSM-20) with a step size as this is a commonly used attack for evaluation.

### f.2 Training and Hyperparameters

##### Cifar-10:

For all of the baselines we recreated and the LLR network we used the same schedule which is inspired by TRADES [28]. For Wide-ResNet-28-8, we use initial learning rate 0.1 and we decrease after 100 and 105 epochs. We train till 110 epochs. For Wide-ResNet-40-8 we use initial learning rate 0.1 and we decrease after 100 and 105 epochs with a factor of 0.1. We train to 110 epochs. The optimizer we used momentum 0.9. For LLR the and , the weight placed on the nominal loss is also 2. We use -regularization of 2e-4. The training is done on a batch size of 256. We also slowly increase the size of the perturbation radius over 15 epochs starting from 0.0 until it gets to 8/255. For Wide-ResNet-28-8, Wide-ResNet-40-8 we train with 10 and 15 steps of PGD respectively using Adam with step size of 0.1.

##### ImageNet (4/255):

To train the LLR network the initial learning rate is 0.1, the decay schedule is similar to [26]

, we decay by 0.1 after 35, 70 and 95 epochs. We train for 100 epochs. The LLR hyperparameters are

and , the weights placed on the nominal loss is 3. We use -regularization of 1e-4. The training is done on batch size of 512. We slowly increase the perturbation radius over 20 epochs from 0 to 4/255. We train with 2 steps of PGD using Adam and step size 0.1.

##### ImageNet (16/255):

To train the LLR network the initial learning rate is 0.1, we decay by 0.1 after 17 and 35 epochs and 50 epochs – we train to 55 epochs. The LLR hyperparameters are and , the weights placed on the nominal loss is 3. We use -regularization of 1e-4. The training is done on batch size of 512. We slowly increase the perturbation radius over 90 epochs from 0 to 16/255. We train with 10 steps of PGD using Adam with step size of 0.1.

##### Batch Normalization

During training we use the local batch statistics at the nominal point. Suppose denotes the local batch statistics at every layer of the network for point . Let us also denote to be the loss function corresponding to when we use batch statistics and . Then the loss we calculate at train time is the following

 ℓ(x;y,μ,σ)+μ∣∣δTLLR∇xℓ(x;y,μ,σ)∣∣+λmaxδ∈B(ϵ)g(δ;x,y,μ,σ),

where and

 g(δ;x,y,μ,σ)=∣∣ℓ(x+δ;y,μ,σ)−ℓ(x;y,μ,σ)−δT∇xℓ(x;y,μ,σ)∣∣.

### f.3 Ablation Studies

We investigate the effects of adding the term into LLR shown in Eq. (8). The results are shown in Table 5. We can see that adding the term only yields minor improvements to the adversarial accuracy (49.38% vs 51.13%) for CIFAR-10, while we get a boost of almost 6% adversarial accuracy for ImageNet (41.30% vs 47.00%).

### f.4 Resistance to Gradient Obfuscation

In Fig 7 we show the adversarial perturbations for networks ADV-2 and LLR-2. We see that, in contrast to LLR-2, the adversarial perturbation for ADV-2 looks similar to random noise. When the adversarial perturbation resembles random noise, this is often a sign that the network is gradient obfuscated.

Furthermore, we show that the adversarial accuracy for LLR-2 is 44.50% as opposed to ADV-2 which is 0%. Surprisingly, even training with just 1 step of PGD for LLR (LLR-1) we obtain non-zero adversarial accuracy.

In Fig 7, we show the values of we obtain when we train with LLR or adversarial training (ADV). To find we maximize by running 50 steps of PGD with step size 0.1. Here, we see that values of for adversarial training with 20 steps of PGD is similar to LLR-2. In contrast, adversarial training (ADV-2) with just two steps of PGD gives much higher values of .

### f.5 Adversarially Perturbed Images for 16/255

The perturbation radius 16/255 has become the norm [14, 26] to use to gauge how robust a network is on ImageNet. However, to be robust we need to make sure that the perturbation is sufficiently small such that it does not significantly affect our visual perception. We hypothesize that this perturbation radius is outside of this regime. Fig 8 shows that we can find examples which not only wipe out objects (the curbs) in the image, but can actually add faint images onto the white background. This significantly affects our visual perception of the image.