The LogBarrier adversarial attack: making effective use of decision boundary information

03/25/2019 ∙ by Chris Finlay, et al. ∙ McGill University 0

Adversarial attacks for image classification are small perturbations to images that are designed to cause misclassification by a model. Adversarial attacks formally correspond to an optimization problem: find a minimum norm image perturbation, constrained to cause misclassification. A number of effective attacks have been developed. However, to date, no gradient-based attacks have used best practices from the optimization literature to solve this constrained minimization problem. We design a new untargeted attack, based on these best practices, using the established logarithmic barrier method. On average, our attack distance is similar or better than all state-of-the-art attacks on benchmark datasets (MNIST, CIFAR10, ImageNet-1K). In addition, our method performs significantly better on the most challenging images, those which normally require larger perturbations for misclassification. We employ the LogBarrier attack on several adversarially defended models, and show that it adversarially perturbs all images more efficiently than other attacks: the distance needed to perturb all images is significantly smaller with the LogBarrier attack than with other state-of-the-art attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep learning models have achieved impressive results in many areas of application. However, deep learning models remain vulnerable to adversarial attacks [SZS13]: small changes (imperceptible to the human eye) in the model input may lead to vastly different model predictions. In security-based applications, this vulnerability is of utmost concern. For example, traffic signs may be modified with small stickers to cause misclassification, causing say a stop sign to be treated as speed limit sign [EEF18]

. Facial recognition systems can be easily spoofed using colourful glasses

[SBBR16].

This security flaw has led to an arms race in the research community, between those who develop defences to adversarial attacks, and those working to overcome these defences with stronger adversarial attack methods [WK18, RSL18]. Notably, as the community develops stronger adversarial attack methods, claims of model robustness to adversarial attack are often proved to be premature [CW17, ACW18].

There are two approaches to demonstrating a model is resistant to adversarial attacks. The first is theoretical, via a provable lower bound on the minimum adversarial distance necessary to cause misclassification [WK18, RSL18, KBD17]. Theoretical lower bounds are often pessimistic: the gap between the theoretical lower bound and adversarial examples generated by state-of-the-art attack algorithms can be large. Therefore, a second empirical approach is also used: an upper bound on the minimum adversarial distance is demonstrated through adversarial examples created by adversarial attacks [KGB16, MMS17, CW17, BRB18]. The motivation to design strong adversarial attacks is therefore twofold: on the one hand, to validate theoretical lower bounds on robustness; and on the other, to construct empirical upper bounds on the minimum adversarial distance. Ideally, the gap between the theoretical lower bound and the empirical upper bound should be small. As adversarial attacks become stronger, the gap narrows from above.

The process of finding an adversarial example with an adversarial attack is an optimization problem: find a small perturbation of the model input which causes misclassification. This optimization problem has been recast in various ways. Rather than directly enforcing misclassification, many adversarial attacks instead attempt to maximize the loss function. The Fast Gradient Signed Method (FGSM) was one of the first adversarial attacks to do so

[SZS13], measuring perturbation size in the norm. Iterative versions of FGSM were soon developed, where perturbations were measured in either the and norms [KGB16, MMS17, ZCR18]. These iterative methods perform Projected Gradient Descent (PGD), maximizing the loss function subject to a constraint enforcing small perturbation in the appropriate norm. Other works have studied sparse adversarial attacks, as in [PMJ16]. Rather than maximizing the loss, Carlini and Wagner [CW17] developed a strong adversarial attack by forcing misclassification to a predetermined target class. If only the decision of the model is available (but not the loss or the model gradients), adversarial examples can still be found using gradient-free optimization techniques [BRB18].

In this paper, rather than using the loss as a proxy for misclassification, we design an adversarial attack that solves the adversarial optimization problem directly: minimize the size of the input perturbation subject to a misclassification constraint. Our method is gradient-based, but does not use the training loss function. The method is based on a sound, well-developed optimization technique, namely the logarithmic barrier method [NW06]. The logarithmic barrier is a simple and intuitive method designed specifically to enforce inequality constraints, which we leverage to enforce misclassification. We compare the LogBarrier attack against current benchmark adversarial attacks (using the Foolbox attack library [RBB17]), on several common datasets (MNIST [LC], CIFAR10 [Kri09], ImageNet-1K [DDS09]) and models. On average, we show that the LogBarrier attack is comparable to current state-of-the-art adversarial attacks. Moreover, we show that on challenging images (those that require larger perturbations for misclassification), the LogBarrier attack consistently outperforms other adversarial attacks. Indeed, we illustrate this point by attacking models trained to be adversarially robust, and show that the LogBarrier attack perturbs all images more efficiently than other attack methods: the LogBarrier attack is able to perturb all images using a much smaller perturbation size than that of other methods.

2. Background material

Adversarial examples arise in classification problems across multiple domains. The literature to date has been concerned primarily with adversarial examples in image classification: adversarial images appears no different (or only slightly so) from an image correctly classified by a model, but despite this similarity, are misclassified.

We let be the space of images. Typically, pixel values are scaled to be between 0 and 1, so that is the unit box . We let be the space of labels. If the images can be one of classes, is usually a subset of . Often

is the probability simplex, but not always. In this case, each element

of a label correspond to the probability an image is of class

. Ground-truth labels are then one-hot vectors.

A trained model, with fixed model weights , is a map . For brevity, in what follows we drop dependence on . For an input image , the model’s predicted classification is the of the model outputs. Given image-label pair , let be the index of the correct label (the of ). A model is correct if .

An adversarial image is a perturbation of the original image, , such that the model misclassifies:

(1)

The perturbation must be small in a certain sense: it must be small enough that a human can still correctly classify the perturbed image. There are various metrics for measuring the size of the perturbation. A common choice is the norm (the max-norm); others use the (Euclidean) norm. If perturbations must be sparse – for example, if the attacker can only modify a small portion of the total image – then the count of non-zero elements in may be used. Throughout this paper we let be a generic metric on the size of the perturbation . Typically is a specific norm, , such as the Euclidean or max norms. Thus the problem of finding an adversarial image may be cast as an optimization problem, minimize the size of the perturbation subject to model misclassification.

The misclassification constraint is difficult to enforce, so a popular alternative is to introduce a loss function . For example, could be the loss function used during model training. In this case the loss measures the ‘correctness’ of the model at an image . If the loss is large at a perturbed image , then it is hoped that the image is also misclassified. The loss function is then used as a proxy for misclassification, which gives rise to the following indirect method for finding adversarial examples:

(2)
subject to

maximize the loss subject to perturbations being smaller than a certain threshold. The optimization approach taken by (2) is by far the most popular method for finding adversarial examples. In one of the first papers on this topic, Szegedy et al [SZS13] proposed the Fast Signed Gradient Method (FGSM), where is the norm of the perturbation , and the solution to (2) is approximated by taking one step in the signed gradient direction. An iterative version with multiple steps, Iterative FGSM (IFGSM) was proposed in [KGB16], and remains the method of choice for adversarial attacks measured in . When perturbations are measured in , (2) is solved with Projected Gradient Descent (PGD) [MMS17].

Fewer works have studied the adversarial optimization problem directly, i.e., without a loss function. In a seminal work, Carlini and Wagner [CW17], developed a targeted attack, in which the adversarial distance is minimized subject to a targeted misclassification. In a targeted attack, not just any misclassification will do: the adversarial perturbation must induce misclassification to a pre-specified target class. The Carlini-Wagner attack (CW) incorporates the targeted misclassification constraint as a penalty term into the objective function. The CW attack was able to overcome many adversarial defence methods that had been thought to be effective, and was an impetus for the adversarial research community’s search for rigorous, theoretical guarantees of adversarial robustness.

There is interest in gradient-free methods for finding adversarial examples. In this scenario, the attacker only has access to the classification of the model, but not the model itself (nor the model’s gradients). In [BRB18], Brendal et al directly minimize the adversarial distance while enforcing misclassification using a gradient-free method. Their Boundary attack iteratively alternates between minimizing the perturbation size, and projecting the perturbation onto the classification boundary. The projection step is approximated by locally sampling the model decision near the classification boundary.

3. The LogBarrier Attack

We tackle the problem of finding (untargeted) adversarial examples by directly solving the following optimization problem,

(3)
subject to

that is, minimize the adversarial distance subject to misclassification. We use the logarithmic barrier method [NW06] to enforce misclassification, as follows. We are given an image-label pair and correct label . Misclassification at an image occurs if there is at least one index of the model’s prediction with greater value than the prediction of the correct index:

(4)

This is a necessary and sufficient condition for misclassification. Thus, we rewrite (3):

(5)
subject to

The barrier method is a standard tool in optimization for solving problems such as (5) with inequality constraints. A complete discussion of the method can be found in [NW06]. In the barrier method, inequality constraints are incorporated into the objective function via a penalty term, which is infinite if a constraint is violated. If a constraint is far from being active, then the penalty term should be small. The negative logarithm is an ideal choice:

(6)

where we denote and . If the gap between is much larger than , the logarithmic barrier term is small. However, as this gap shrinks, the penalty term approaches infinity. Thus the penalty acts as a barrier, forcing an optimization algorithm to search for solutions where the constraint is inactive. If (6) is solved iteratively with smaller and smaller values of , in the limit as , the solution to the original problem (5) is recovered. (This argument can be made formal if desired, using -convergence [Bra02].) See Figure 1, where the barrier function is plotted with decreasing values of . In the limit as , the barrier becomes 0 if the constraint is satisfied, and otherwise.

3.1. Algorithm description

Figure 1. The logarithmic barrier function defined over . As decreases, the barrier becomes steeper, mimicking a hard constraint.

We now give a precise description of our implementation of the log barrier method for generating adversarial images. The constraint can be viewed as a feasible set. Thus, the algorithm begins by finding an initial feasible image: the original image must be perturbed so that it is misclassified (not necessarily close to the original). There are several ways to find a misclassified image. A simple approach would be to take another natural image with a different label. However, we have found in practice that closer initial images are generated by randomly perturbing the original image with increasing levels of noise (e.g. Standard Normal or Bernoulli) until it is misclassified. After each random perturbation, the image is projected back onto the set of images in the box, via the projection . This process is briefly described in Algorithm 1. Note that if the original image is already misclassified, no random perturbation is performed, since the original image is already adversarial.

  Input: image , model , , step-size , .
  Initialize: or
  for  to  do
     if  misclassified then
        Exit for-loop
     else
        Sample from
        
     end if
  end for
Algorithm 1 LogBarrier: Initialization

After an initial perturbation is found, we solve (6) for a fixed . Various optimization methods algorithms may be used to solve (6). For small- to medium-scale problems, variants of Newton’s method are typically preferred. However, due to computational constraints, we chose to use gradient descent. After each gradient descent step, we check to ensure that the updated adversarial image remains in the box. If not, it is projected back into the set of images with the projection .

It is possible that a gradient descent step moves the adversarial image so that the image is correctly classified by the model. If this occurs, we simply backtrack along the line between the current iterate and the previous iterate, until we regain feasibility. To illustrate the backtracking procedure, let be the previous iterate, and be a candidate adversarial image which is now correctly classified. We continue backtracking the next iterate via

(7)

until the iterate is misclassified. The hyper-parameter is a backtracking parameter. The accumulation point of the above sequence is . As a result, this process is guaranteed to terminate, since the previous iterate is itself misclassified. This backtracking procedure is sometimes necessary when iterates are very close to the decision boundary. If the iterate is very close to the decision boundary, then the gradient of the log barrier term is very large, and dominates the update step. Since the constraint set is not necessarily convex or even fully connected, it is possible that the iterate could be sent far from the previous iterate without maintaining misclassification. We rarely experience this phenomenon in practice, but include the backtracking step as a safety. An alternate approach (which we did not implement), more aligned with traditional optimization techniques, would be instead to use a dynamic step size rule such as the Armijo-Goldstein condition [Arm66].

Figure 2. The central path taken by the LogBarrier attack. Dashed lines represent level sets of the logarithmic barrier function. As decreases iterates approach the decision boundary.
  Input: original image , initial misclassified image , model , distance measure Hyperparameters: backtrack factor ; initial penalty size ; step size ; shrink factor ; termination threshold ; and maximum iterations .
  for  to  do
     
     for  to  do
        
        while  not misclassified do
           
        end while
        if  then
           break
        end if
     end for
  end for
Algorithm 2 LogBarrier attack

The gradient descent algorithm comprises a series of iterates in an inner loop. Recall that as , the log barrier problem approaches the original problem (5). Thus, we shrink by a certain factor and repeat the procedure again, iterating in a series of outer loops (of course, initializing now with previous iterate). As shrinks, the solutions to (6) approach the decision boundary. In each inner loop, if the iterates fail to move less than some threshold value , we move onto the next outer loop. The path taken by the iterates of the outer loop is called the central path, illustrated in Figure 2.

The LogBarrier attack pseudocode is presented in Algorithm 2. For brevity we write the log barrier . We remark that the LogBarrier attack can be improved by running the method several times, with different random initializations (although we do not implement this here).

The literature on adversarial perturbations primarily focuses on perturbations measured in the and norms. For perturbations measured in the norm, we set the distance measure to be the squared Euclidean norm, . When perturbations are measured in the norm, we do not use the max-norm directly as a measure, due to the fact that the norm is non-smooth with sparse subgradients. Instead, we use the following approximation of the norm [LZH14],

where . As the norm is recovered.

Algorithm hyper-parameters

Like many optimization routines, the logarithmic barrier method has several hyper-parameters. However, because our implementation is parallelized, we have found that the tuning process is relatively quick. For the attack, our default parameters are , and . For , we set with and ; the rest are the same as in the case.

For the initialization procedure, we have and . If attacking in

, we initialize using the Standard Normal distribution. Else, for

, we use the Bernoulli initialization with .

Top5 misclassification

The LogBarrier attack may be generalized to enforce Top5 misclassification as well. In this case, the misclassification constraint is that , where now is the index of sorted model outputs. (In other words, , and is the second-largest model output, and so forth.) We then set the barrier function to be . In this scenario, the LogBarrier attack is initialized with an image that is not classified in the Top5.

4. Experimental results

We compare the LogBarrier attack with current state-of-the-art adversarial attacks on three benchmark datasets: MNIST [LC], CIFAR10 [Kri09], and ImageNet-1K [DDS09]. On MNIST and CIFAR10, we attack 1000 randomly chosen images; on ImageNet-1K we attack 500 randomly selected images, due to computational constraints. On ImageNet-1K, we use the Top5 version of the LogBarrier attack.

All other attack methods are implemented using the adversarial attack library Foolbox [RBB17]. For adversarial attacks measured in , we compare the LogBarrier attack against Projected Gradient Descent (PGD) [MMS17], the Carlini-Wagner attack (CW) [CW17], and the Boundary attack (BA) [BRB18]. These three attacks all very strong, and consistently perform well in adversarial attack competitions. When measured in , we compare against IFGSM [KGB16], the current state-of-the-art. We leave Foolbox hyper-parameters to their defaults, except the number of iterations in the Boundary attack, which we set to a maximum of 5000 iterations.

4.1. Undefended networks

width=.6 MNIST CIFAR10 Imagenet-1K AllCNN ResNeXt34 2.3 LogBarrier 99.10 98.70 99.90 98.40 CW 98.50 97.30 90.40 74.86 PGD 52.58 86.60 59.80 90.00 BA 97.20 98.70 99.60 48.80

Table 1. Percent misclassification of the networks at a specified perturbation size, for attacks measured in . Because we are measuring the strength of adversarial attacks, at a given adversarial distance, a higher percentage misclassified is better.

width=.6 MNIST CIFAR10 Imagenet-1K AllCNN ResNeXt34 0.3 LogBarrier 94.80 100 98.70 95.20 IFGSM 73.40 93.1 75.80 99.60

Table 2. Percent misclassification of the networks at a specified perturbation size, for attacks measured in . Higher percentage misclassified is better.
MNIST CIFAR10 ImageNet-1K
AllCNN ResNeXt34
LogBarrier 1.29
CW 1.27 1.59
PGD 2.54 2.53 1.15
BA 1.41 1.55 3.31
Table 3. Adversarial attacks perturbation statistics in the

norm. We report the mean and variance of the adversarial distance on a subsample of the test dataset. Lower values are better.

MNIST CIFAR10 ImageNet-1K
AllCNN ResNeXt34
LogBarrier
IFGSM
Table 4. Adversarial attacks perturbation statistics in the norm. We report the mean and variance of the adversarial attack distance for each method on a subsample of the test dataset. Lower values are better.

We first study the LogBarrier attack on networks that have not been trained to be adversarially robust. For MNIST, we use the network described in [CW17, PMJ16]. On CIFAR10, we consider two networks: AllCNN [SDBR14], a shallow convolutional network; and a ResNeXt34 (2x32) [XGD17], a much deeper network residual network. Finally, for ImageNet-1K, we use a pre-trained ResNet50 [HZRS16]

available for download on the PyTorch website.

Tables 1 and 2 report the percentage misclassified, for each attack at a fixed perturbation size. A strong attack should have a high misclassification rate. In the tables, the perturbation size is chosen to agree with attack thresholds commonly reported in the adversarial literature. Measured in Euclidean norm, we see that the LogBarrier attack is the strongest on all datasets and models. Measured in the max-norm, the LogBarrier outperforms IFGSM on all datasets and models, except on ImageNet-1K where the difference is slight.

We also report the mean and variance of the adversarial attack distances, measured in and , in Tables 3 and 4 respectively. A strong adversarial attack should have a small mean adversarial distance, and a small variance. Small variance is necessary to ensure precision of the attack method. A strong attack method should be able to consistently find close adversarial examples. Table 3 demonstrates that, measured in , the LogBarrier attack is either the first ranked attack, or a close second. When measured in , the LogBarrier attack significantly outperforms IFGSM on all datasets and models, except ImageNet-1K.

(a) attacks on MNIST
(b) attacks on CIFAR10
Figure 3. Adversarial images for perturbations, generated by the LogBarrier and IFGSM adversarial attacks, compared against the original clean image. Where IFGSM has difficulties finding adversarial images, the LogBarrier method succeeds: LogBarrier adversarial images are visibly less distorted than IFGSM adversarial images.

For illustration, we show examples of adversarial images from the IFGSM and LogBarrier attacks in Figure 3. On images where IFGSM requires a large distance to adversarially perturb, the LogBarrier attack produces visibly less distorted images.

(a) MNIST
(b) CIFAR10
Figure 4. Overlay of attack curves, measured in , on (a) MNIST and (b) CIFAR10 networks. Two types of networks are compared: an undefended network, and a defended network (denoted (D)), trained using the same architecture as the undefended network with adversarial training. The LogBarrier attack requires a smaller adversarial distance to attack all images, compared to IFGSM.

4.2. Defended networks

In this section we turn to attacking adversarially defended networks. We first consider two defence strategies: gradient obfuscation [ACW18], and multi-step adversarial training as described in Madry et al [MMS17]. We study these two strategies on the MNIST and ResNeXt34 networks used in Section 4.1. We limit ourselves to studying defence methods for attacks in the norm. Attacks are performed on the same 1000 randomly selected images as the previous section. Finally, we also test our attack on a MNIST model trained with Convex Adversarial Polytope [WK18] training, the current state-of-the-art defence method on MNIST.

Gradient Obfuscation

Although discredited as a defence method [ACW18]

, gradient obfuscation is a hurdle any newly proposed adversarial attack method must be able to surmount. We implement gradient obfuscation by increasing the temperature on the softmax function computing model probabilities from model logits. As the softmax temperature increases, the size of the gradients of the model probabilities approaches zero, because the model probabilities approach one-hot vectors. Although the decision boundary of the model does not change, many adversarial attack algorithms have difficulty generating adversarial examples when model gradients are small.

In Tables 5 and 6 we show that the LogBarrier attack easily overcomes gradient obfuscation, on both CIFAR10 and MNIST models. The reason that the LogBarrier method is able to overcome gradient obfuscation is simple: away from the decision boundary, the logarithmic barrier term is not active (indeed, it is nearly zero). Thus the LogBarrier algorithm focuses on minimizing the adversarial distance, until it is very close to the decision boundary, at which point the barrier term activates. In contrast, because IFGSM is a local method, if model gradients are small, it has a difficult time climbing the loss landscape, and is not able to generate adversarial images.

width= Undefended Obfuscated Adversarial training LogBarrier 15.70 94.80 18.30 99.80 2.90 95.40 IFGSM 12.40 62.5 8.60 32.90 NA 3.00 53.80

Table 5. Defence strategies on MNIST. We report the percentage misclassified at adversarial magnitudes

and 0.3; higher is better. We also report the attack magnitude needed to perturb 90% of the images (the 90% quantile of attacks, written

). ’NA’ indicates that the attack failed.

width= Undefended Obfuscated Adversarial training LogBarrier 98.40 98.70 47.60 54.40 23.40 48.10 IFGSM 58.30 75.80 36.90 43.90 NA 31.60 54.90

Table 6. Defence strategies on ResNeXt34 on the CIFAR10 dataset. We report the percentage misclassified at adversarial magnitudes of and , and the magnitude required to perturb 90% of test images. If the adversarial attack was unsuccessful, we report NA.

Adversarial Training

Adversarial training is a popular method for defending against adversarial attacks. We test the LogBarrier attack on networks trained with multi-step adversarial training in the norm, as presented in Madry et al [MMS17]. Our results are shown in Tables 5 and 6. We also plot defence curves of the LogBarrier and IFGSM attacks on defended and undefended models in Figures 3(a) and 3(b), for respectively MNIST and CIFAR10.

On MNIST, we did not observe a reduction in test accuracy on clean images with adversarially trained models compared to undefended models. As expected, adversarial training hinders both LogBarrier and IFGSM from finding adversarial images at very small distances. However, we see that the LogBarrier attack is able to attack all images with nearly the same distance in both the defended and undefended models. In contrast, IFGSM requires a very large adversarial distance to attack all images on the defended model, as shown in Figure 3(a). That is, adversarial training does not significantly reduce the empirical distance required to perturb all images, when the LogBarrier attack is used. The point is illustrated in Table 5, where we report the distance required to perturb 90% of all images. The LogBarrier attack requires an adversarial distance of 0.22 on the undefended MNIST model, and 0.29 on the defended MNIST model, to perturb 90% of all images. In contrast, IFGSM requires a distance of 0.46 on the undefended model, but 0.65 on the defended model.

On CIFAR10, we observe the same behaviour, although the phenomenon is less pronounced. As shown in Table 6 and Figure 3(b), the LogBarrier attack requires a smaller adversarial distance to perturb all images than IFGSM. Notably, the LogBarrier attack on the defended network is able to attack all images with a smaller adversarial distance than even IFGSM on the undefended network.

Against the Convex Adversarial Polytope

Finally, we use the LogBarrier attack on a provable defence strategy, the Convex Adversarial Polytope [WK18]. The Convex Adversarial Polytope is a method for training a model to guarantee that no more than a certain percentage of images may be attacked at a given adversarial distance. We chose to attack the defended MNIST network in [WK18], which is guaranteed to have no more than 5.82% misclassification at perturbation size . We validated this theoretical guarantee with both the LogBarrier attack and IFGSM, and found that both methods were unable to perturb more than 3% of test images at distance 0.1.

5. Discussion

We have presented a new adversarial attack that uses a traditional method from the optimization literature, namely the logarithmic barrier method. The LogBarrier attack is effective in both the and norms. The LogBarrier attack directly solves the optimization problem posed by the very definition of adversarial images; i.e., find an image close to an original image, while being misclassified by a network. This is in contrast to many other adversarial attack problems (such as PGD or IFGSM), which attempt to maximize a loss function as a proxy to the true adversarial optimization problem. Whereas loss-based adversarial attacks start locally at or near the original image, the LogBarrier attack begins far from the original image. In this sense, the LogBarrier attack is similar in spirit to the Boundary attack [BRB18]: both the LogBarrier attack and the Boundary attack begin with a misclassified image, and iteratively move the image closer to the original image, while maintaining misclassification. The LogBarrier attack is a gradient-based attack: to enforce misclassification, gradients of the logarithmic barrier are required. In contrast, the Boundary attack is gradient-free, and uses rejection sampling to enforce misclassification. Although the LogBarrier attack uses gradients, we have shown that it is not impeded by gradient obfuscation, a common drawback to other gradient-based attacks. Because the LogBarrier attack is able to use gradients, it is typically faster than the Boundary attack.

The LogBarrier attack may be used as an effective tool to validate claims of adversarial robustness. We have shown that one strength of the LogBarrier attack is its ability to attack all images in a test set, using a fairly small maximum adversarial distance compared to other attacks. In other words, the LogBarrier attack estimates the mean adversarial distance with high precision. Using the LogBarrier attack, we have raised questions about the robustness of multi-step adversarial training

[MMS17]. For instance, on MNIST, we showed that multi-step adversarial training did not significantly improve the necessary distance required to perturb all test images, relative to an undefended model. For adversarially trained models on CIFAR10, we showed that the necessary distance to perturb all images is significantly smaller than the estimate provided by IFGSM. This is further motivation for the development of rigorous, theoretical guarantees of model robustness.

References