A rapidly growing area of work has studied the existence of adversarial examples, datapoints which have been perturbed to fool a classifier, but the vast majority of these works have focused primarily on threat models defined by ℓ_p norm-bounded perturbations. In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance. In the image classification setting, such distances measure the cost of moving pixel mass, which naturally cover "standard" image manipulations such as scaling, rotation, translation, and distortion (and can potentially be applied to other settings as well). To generate Wasserstein adversarial examples, we develop a procedure for projecting onto the Wasserstein ball, based upon a modified version of the Sinkhorn iteration. The resulting algorithm can successfully attack image classification models, bringing traditional CIFAR10 models down to 3 within a Wasserstein ball with radius 0.1 (i.e., moving 10 pixel), and we demonstrate that PGD-based adversarial training can improve this adversarial accuracy to 76 study in adversarial robustness, more formally considering convex metrics that accurately capture the invariances that we typically believe should exist in classifiers. Code for all experiments in the paper is available at https://github.com/locuslab/projected_sinkhorn.READ FULL TEXT VIEW PDF
Robustness against image perturbations bounded by a ℓ_p ball have been
In the last couple of years, several adversarial attack methods based on...
Deep models, while being extremely flexible and accurate, are surprising...
The prediction accuracy has been the long-lasting and sole standard for
Recent work has shown that additive threat models, which only permit the...
We show that when taking into account also the image domain [0,1]^d,
We propose a method to learn deep ReLU-based classifiers that are provab...
A substantial effort in machine learning research has gone towards studying adversarial examples (Szegedy et al., 2014), commonly described as datapoints that are indistinguishable from “normal” examples, but are specifically perturbed to be misclassified by machine learning systems. This notion of indistinguishability, later described as the threat model for attackers, was originally taken to be bounded perturbations, which model a small amount of noise injected to each pixel (Goodfellow et al., 2015). Since then, subsequent work on understanding, attacking, and defending against adversarial examples has largely focused on this threat model and its corresponding generalization. While the ball is a convenient source of adversarial perturbations, it is by no means a comprehensive description of all possible adversarial perturbations. Other work (Engstrom et al., 2017) has looked at perturbations such as rotations and translations, but beyond these specific transforms, there has been little work considering broad classes of attacks beyond the ball.
In this paper, we propose a new type of adversarial perturbation that encodes a general class of attacks that is fundamentally different from the ball. Specifically, we propose an attack model where the perturbed examples are bounded in Wasserstein distance from the original example. This distance can be intuitively understood for images as the cost of moving around pixel mass to move from one image to another. Note that the Wasserstein ball and the ball can be quite different in their allowable perturbations: examples that are close in Wasserstein distance can be quite far in distance, and vice versa (a pedagogical example demonstrating this is in Figure 1).
We develop this idea of Wasserstein adversarial examples in two main ways. Since adversarial examples are typically best generated using variants of projected gradient descent, we first derive an algorithm that projects onto the Wasserstein ball. However, performing an exact projection is computationally expensive, so our main contribution here is to derive a fast method for approximate projection. The procedure can be viewed as a modified Sinkhorn iteration, but with a more complex set of update equations. Second, we develop efficient methods for adversarial training under this threat method. Because this involves repeatedly running this projection within an inner optimization loop, speedups that use a local transport plan are particularly crucial (i.e. only moving pixel mass to nearby pixels), making the projection complexity linear in the image size.
We evaluate the attack quality on standard models, showing for example that we can reduce the adversarial accuracy of a standard CIFAR10 classifier from 94.7% to 3% using a Wasserstein ball of radius 0.1 (equivalent to moving 10% of the mass of the image by one pixel), whereas the same attack reduces the adversarial accuracy of a model certifiably trained against perturbations from 66% to 61%. In contrast, we show that with adversarial training, we are able to improve the adversarial accuracy of this classifier to 76% while retaining a nominal accuracy of 80.7%. We additionally show, however, that existing certified defenses cannot be easily extended to this setting; building models provably robust to Wasserstein attacks will require fundamentally new techniques. In total, we believe this work highlights a new direction in adversarial examples: convex perturbation regions which capture a much more intuitive form of structure in their threat model, and which move towards a more “natural” notion of adversarial attacks.
Much of the work in adversarial examples has focused on the original threat model presented by Goodfellow et al. (2015), some of which also extends naturally to
perturbations. Since then, there has been a plethora of papers studying this threat model, ranging from improved attacks, heuristic and certified defenses, and verifiers. As there are far too many to discuss here, we highlight a few which are the most relevant to this work.
The most commonly used method for generating adversarial examples is to use a form of projected gradient descent over the region of allowable perturbations, originally referred to as the Basic Iterative Method (Kurakin et al., 2017). Since then, there has been a back-and-forth of new heuristic defenses followed by more sophisticated attacks. To name a few, distillation was proposed as a defense but was defeated (Papernot et al., 2016; Carlini & Wagner, 2017), realistic transformations seen by vehicles were thought to be safe until more robust adversarial examples were created (Lu et al., 2017; Athalye et al., 2018b), and many defenses submitted to ICLR 2018 were broken before the review period even finished (Athalye et al., 2018a). One undefeated heuristic defense is to use the adversarial examples in adversarial training, which has so far worked well in practice (Madry et al., 2018). While this method has traditionally been used for and balls (and has a natural generalization), in principle, the method can be used to project onto any kind of perturbation region.
Another set of related papers are verifiers and provable defenses, which aim to produce (or train on) certificates that are provable guarantees of robustness against adversarial attacks. Verification methods are now applicable to multi-layer neural networks using techniques ranging from semi-definite programming relaxations(Raghunathan et al., 2018)
, mixed integer linear programming(Tjeng et al., 2019), and duality (Dvijotham et al., 2018)
. Provable defenses are able to tie verification into training non-trivial deep networks by backpropagating through certificates, which are generated with duality-based bounds(Wong & Kolter, 2018; Wong et al., 2018), abstract interpretations (Mirman et al., 2018), and interval bound propagation (Gowal et al., 2018). These methods have subsequently inspired new heuristic training defenses, where the resulting models can be independently verified as robust (Croce et al., 2018; Xiao et al., 2019). Notably, some of these approaches are not overly reliant on specific types of perturbations (e.g. duality-based bounds). Despite their generality, these certificates have only been trained and evaluated in the context of and balls, and we believe this is due in large part to a lack of alternatives.
Highly relevant to this work are attacks that lie outside the traditional ball of imperceptible noise. For example, simple rotations and translations form a fairly limited set of perturbations that can be quite large in norm, but are sometimes sufficient in order to fool classifiers (Engstrom et al., 2017)
. On the other hand, adversarial examples that work in the real world do not necessarily conform to the notion of being “imperceptible”, and need to utilize a stronger adversary that is visible to real world systems. Some examples include wearing adversarial 3D printed glasses to fool facial recognition(Sharif et al., 2017), the use of adversarial graffiti to attack traffic sign classification (Eykholt et al., 2018), and printing adversarial textures on objects to attack image classifiers (Athalye et al., 2018b). While Sharif et al. (2017) allows perturbations that are physical glasses, the others use an threat model with a larger radius, when a different threat model could be a more natural description of adversarial examples that are perceptible on camera.
Last but not least, our paper relies heavily on the Wasserstein distance, which has seen applications throughout machine learning. The traditional notion of Wasserstein distance has the drawback of being computationally expensive: computing a single distance involves solving an optimal transport problem (a linear program) with a number of variables quadratic in the dimension of the inputs. However, it was shown that by subtracting an entropy regularization term, one can compute approximate Wasserstein distances extremely quickly using the Sinkhorn iteration (Cuturi, 2013), which was later shown to run in near-linear time (Altschuler et al., 2017). Relevant but orthogonal to our work, is that of Sinha et al. (2018) on achieving distributional robustness using the Wasserstein distance. While we both use the Wasserstein distance in the context of adversarial training, the approach is quite different: Sinha et al. (2018) use the Wasserstein distance to perturb the underlying data distribution, whereas we use the Wasserstein distance as an attack model for perturbing each example.
This paper takes a step back from using as a perturbation metric, and proposes using the Wasserstein distance instead as an equivalently general but qualitatively different way of generating adversarial examples. To tackle the computational complexity of projecting onto a Wasserstein ball, we use ideas from the Sinkhorn iteration (Cuturi, 2013) to derive a fast method for an approximate projection. Specifically, we show that subtracting a similar entropy-regularization term to the projection problem results in a Sinkhorn-like algorithm, and using local transport plans makes the procedure tractable for generating adversarial images. In contrast to and perturbations, we find that the Wasserstein metric generates adversarial examples whose perturbations have inherent structure reflecting the actual image itself (see Figure 2 for a comparison). We demonstrate the efficacy of this attack on standard models, models trained against this attack, and provably robust models (against attacks) on MNIST and CIFAR10 datasets. While the last of these models are not trained to be robust specifically against this attack, we observe that that some (but not all) robustness empirically transfers over to protection against the Wasserstein attack. More importantly, we show that while the Wasserstein ball does fit naturally into duality based frameworks for generating and training against certificates, there is a fundamental roadblock preventing these methods from generating non-vacuous bounds on Wasserstein balls.
The most common method of creating adversarial examples is to use a variation of projected gradient descent. Specifically, let be a datapoint and its label, and let be some ball around with radius , which represents the threat model for the adversary. We first define the projection operator onto to be
which finds the point closest (in Euclidean space) to the input that lies within the ball . Then, for some step size and some loss (e.g. cross-entropy loss), the algorithm consists of the following iteration:
where or any randomly initialized point within . This is sometimes referred to as projected steepest descent, which is used to generated adversarial examples since the standard gradient steps are typically too small. If we consider the ball and use steepest descent with respect to the norm, then we recover the Basic Iterative Method originally presented by Kurakin et al. (2017).
One of the heuristic defenses that works well in practice is to use adversarial training with a PGD adversary. Specifically, instead of minimizing the loss evaluated at a example , we minimize the loss on an adversarially perturbed example , where is obtained by running the projected gradient descent attack for the ball for some number of iterations, as shown in Algorithm 1. Taking to be an ball recovers the procedure used by Madry et al. (2018).
Finally, we define the most crucial component of this work, an alternative metric from
distances. The Wasserstein distance (also referred to as the Earth mover’s distance) is an optimal transport problem that can be intuitively understood in the context of distributions as the minimum cost of moving probability mass to change one distribution into another. When applied to images, this can be interpreted as the cost of moving pixel mass from one pixel to another another, where the cost increases with distance.
More specifically, let be two non-negative data points such that , so images and other inputs need to be normalized, and let be some non-negative cost matrix where encodes the cost of moving mass from to . Then, the Wasserstein distance between and is defined to be
where the minimization over transport plans , whose entries encode how the mass moves from to . Then, we can define the Wasserstein ball with radius as
The crux of this work relies on offering a fundamentally different type of adversarial example from typical, perturbations: the Wasserstein adversarial example.
In order to generate Wasserstein adversarial examples, we can run the projected gradient descent attack from Equation (2), dropping in the Wasserstein ball from Equation (4) in place of . However, while projections onto regions such as and balls are straightforward and have closed form computations, simply computing the Wasserstein distance itself requires solving an optimization problem. Thus, the first natural requirement to generating Wasserstein adversarial examples is to derive an efficient way to project examples onto a Wasserstein ball of radius . Specifically, projecting onto the Wasserstein ball around with radius and transport cost matrix can be written as solving the following optimization problem:
While we could directly solve this optimization problem (using an off-the-shelf quadratic programming solver), this is prohibitively expensive to do for every iteration of projected gradient descent, especially since there is a quadratic number of variables. However, Cuturi (2013) showed that the standard Wasserstein distance problem from Equation (3) can be approximately solved efficiently by subtracting an entropy regularization term on the transport plan , and using the Sinkhorn-Knopp matrix scaling algorithm. Motivated by these results, instead of solving the projection problem in Equation (5) exactly, the key contribution that allows us to do the projection efficiently is to instead solve the following entropy-regularized projection problem:
Although this is an approximate projection onto the Wasserstein ball, importantly, the looseness in the approximation is only in finding the projection which is closest (in norm) to the original example . All feasible points, including the optimal solution, are still within the actual -Wasserstein ball, so examples generated using the approximate projection are still within the Wasserstein threat model.
The dual of the entropy-regularized Wasserstein projection problem in Equation (6) is
Note that the dual problem here differs from the traditional dual problem for Sinkhorn iterates by having an additional quadratic term on and an additional dual variable . Nonetheless, we can still derive a Sinkhorn-like algorithm by performing block coordinate ascent over the dual variables (the full derivation can be found in Appendix A.3). Specifically, maximizing with respect to results in
which is identical (up to a log transformation of variables) to the original Sinkhorn iterate proposed in Cuturi (2013). The maximization step for can also be done analytically with
where is the Lambert function, which is defined as the inverse of . Finally, since cannot be solved for analytically, we can perform the following Newton step
and where is small enough such that . Once we have solved the dual problem, we can recover the primal solution (to get the actual projection), which is described in Lemma 2 and proved in Appendix A.2.
The whole algorithm can then be vectorized and implemented as Algorithm2, which we call projected Sinkhorn iterates. The algorithm uses a simple line search to ensure that the constraint is not violated. Each iteration has 8 operations (matrix-vector product or matrix-matrix element-wise product), in comparison to the original Sinkhorn iteration which has 2 matrix-vector products.
The original Sinkhorn iteration has a natural interpretation as a matrix scaling algorithm, iteratively rescaling the rows and columns of a matrix to achieve the target distributions. The Projected Sinkhorn iteration has a similar interpretation: while the step rescales the rows of to sum to , the step rescales the columns of to sum to , which is the primal transformation of the projected variable at optimality as described in Lemma 2. Lastly, the step can be interpreted as correcting for the transport cost of the current scaling: the numerator of the Newton step is simply the difference between the transport cost of the current matrix scaling and the maximum constraint . A full derivation of the algorithm and a more detailed explanation on this interpretation can be found in Appendix A.3.
The quadratic runtime dependence on input dimension can grow quickly, and this is especially true for images. Rather than allowing transport plans to move mass to and from any pair of pixels, we instead restrict the transport plan to move mass only within a region of the originating pixel, similar in spirit to a convolutional filter. As a result, the cost matrix only needs to define the cost within a region, and we can utilize tools used for convolutional filters to efficiently apply the cost to each region. This reduces the computational complexity of each iteration to . For images with more than one channel, we can use the same transport plan for each channel and only allow transport within a channel, so the cost matrix remains . For local transport plans on CIFAR10, the projected Sinkhorn iterates typically converge in around 30-40 iterations, taking about 0.02 seconds per iteration on a Titan X for minibatches of size 100. Note that if we use a cost matrix that reflects the 1-Wasserstein distance, then this problem could be solved even more efficiently using Kantrovich duality, however we use this formulation to enable more general -Wasserstein distances, or even non-standard cost matrices.
With local transport plans, the method is fast enough to be used within a projected gradient descent routine to generate adversarial examples on images, and further used for adversarial training as in Algorithm 1 (using steepest descent with respect to norm), except that we do an approximate projection onto the Wasserstein ball using Algorithm 2.
In this section, we run the Wasserstein examples through a range of typical experiments in the literature of adversarial examples. Table 1 summarizes the nominal error rates obtained by all considered models. All experiments can be run on a single GPU, and all code for the experiments is available at https://github.com/locuslab/projected_sinkhorn.
|Data set||Model||Nominal Accuracy|
For MNIST we used the convolutional ReLU architecture used inWong & Kolter (2018), with two convolutional layers with 16 and 32 filters each, followed by a fully connected layer with 100 units, which achieves a nominal accuracy of 98.89%. For CIFAR10 we focused on the standard ResNet18 architecture (He et al., 2016), which achieves a nominal accuracy of 94.76%.
For all experiments in this section, we focused on using local transport plans for the Wasserstein ball, and used an entropy regularization constant of 1000 for MNIST and 3000 for CIFAR10. The cost matrix used for transporting between pixels is taken to be the 2-norm of the distance in pixel space (e.g. the cost of going from pixel to is ), which makes the optimal transport cost a metric more formally known as the 1-Wasserstein distance. For more extensive experiments on using different sizes of transport plans, different regularization constants, and different cost matrices, we direct the reader to Appendix C.
We use the follow evaluation procedure to attack models with projected gradient descent on the Wasserstein ball. For each MNIST example, we start with and increase it by a factor of 1.1 every 10 iterations until either an adversarial example is found or until 200 iterations have passed, allowing for a maximum perturbation radius of . For CIFAR10, we start with and increase it by a factor of 1.17 until either and adversarial example is found or until 400 iterations have passed, allowing for a maximum perturbation radius of .
For MNIST, we consider a standard model, a model with binarization, a model provably robust to perturbations of at most , and an adversarially trained model. We provide a visual comparison of the Wasserstein adversarial examples generated on each of the four models in Figure 3. The susceptibility of all four models to the Wasserstein attack is plotted in Figure 4.
For MNIST, despite restricting the transport plan to local regions, a standard model is easily attacked by Wasserstein adversarial examples. In Figure 4, we see that Wasserstein attacks with can successfully attack a typical MNIST classifier 50% of the time, which goes up to 94% for . A Wasserstein radius of can be intuitively understood as moving 50% of the pixel mass over by 1 pixel, or alternatively moving less than 50% of the pixel mass more than 1 pixel. Furthermore, while preprocessing images with binarization is often seen as a way to trivialize adversarial examples on MNIST, we find that it performs only marginally better than the standard model against Wasserstein perturbations.
We also run the attack on the model trained by Wong et al. (2018), which is guaranteed to be provably robust against perturbations with . While not specifically trained against Wasserstein perturbations, in Figure 4 we find that it is substantially more robust than either the standard or the binarized model, requiring a significantly larger to have the same attack success rate.
Finally, we apply this attack as an inner procedure within an adversarial training framework for MNIST. To save on computation, during training we adopt a weaker adversary and use only 50 iterations of projected gradient descent. We also let grow within a range and train on the first adversarial example found (essentially a budget version of the attack used at test time). Specific details regarding this schedule and also the learning parameters used can be found in Appendix B.1. We find that the adversarially trained model is empirically the most well defended against this attack of all four models, and cannot be attacked down to 0% accuracy (Figure 4).
For CIFAR10, we consider a standard model, a model provably robust to perturbations of at most , and an adversarially trained model. We plot the susceptibility of each model to the Wasserstein attack in Figure 5.
We find that for a standard ResNet18 CIFAR10 classifier, a perturbation radius of as little as is enough to misclassify 25% of the examples, while a radius of is enough to fool the classifier 97% of the time (Figure 5). Despite being such a small , we see in Figure 6 that the structure of the perturbations still reflect the actual content of the images, though certain classes require larger magnitudes of change than others.
We further empirically evaluate the attack on a model that was trained to be provably robust against perturbations. We use the models weights from Wong et al. (2018), which are trained to be provably robust against perturbations of at most . Further note that this CIFAR10 model actually is a smaller ResNet than the ResNet18 architecture considered in this paper, and consists of 4 residual blocks with 16, 16, 32, and 64 filters. Nonetheless, we find that while the model suffers from poor nominal accuracy (achieving only 66% accuracy on unperturbed examples as noted in Table 1), the robustness against attacks remarkably seems to transfer quite well to robustness against Wasserstein attacks in the CIFAR10 setting, achieving 61% adversarial accuracy for in comparison to 3% for the standard model.
To perform adversarial training for CIFAR10, we use a similar scheme to that used for MNIST: we adopt a weaker adversary that uses only 50 iterations of projected gradient descent during training and allow to grow within a range (specific details can be found in Appendix B.2). We find that adversarial training here is also able to defend against this attack, and at the same threshold of , we find that the adversarial accuracy has been improved from 3% to 76%.
Lastly, we present some analysis on how this attack fits into the context of provable defenses, along with a negative result demonstrating a fundamental gap that needs to be solved. The Wasserstein attack can be naturally incorporated into duality based defenses: Wong et al. (2018) show that to use their certificates to defend against other inputs, one only needs to solve the following optimization problem:
for some constant and for some perturbation region (a similar approach can be taken to adapt the dual verification from Dvijotham et al. (2018)). For the Wasserstein ball, this is highly similar to the problem of projecting onto the Wasserstein ball from Equation (6), with a linear objective instead of a quadratic objective and fewer variables. In fact, a Sinkhorn-like algorithm can be derived to solve this problem, which ends up being a simplified version of Algorithm 2 (this is shown in Appendix D).
However, there is a fundamental obstacle towards generating provable certificates against Wasserstein attacks: these defenses (and many other, non-duality based approaches) depend heavily on propagating interval bounds from the input space through the network, in order to efficiently bound the output of ReLU units. This concept is inherently at odds with the notion of Wasserstein distance: a “small” Wasserstein ball can use a low-cost transport plan to move all the mass at a single pixel to its neighbors, or vice versa. As a result, when converting a Wasserstein ball to interval constraints, the interval bounds immediately become vacuous: each individual pixel can attain their minimum or maximum value under somecost transport plan. In order to guarantee robustness against Wasserstein adversarial attacks, significant progress must be made to overcome this limitation.
In this paper, we have presented a new, general threat model for adversarial examples based on the Wasserstein distance, a metric that captures a kind of perturbation that is fundamentally different from traditional perturbations. To generate these examples, we derived an algorithm for fast, approximate projection onto the Wasserstein ball that can use local transport plans for even more speedup on images. We successfully attacked standard networks, showing that these adversarial examples are structurally perturbed according to the content of the image, and demonstrated the empirical effectiveness of adversarial training. Finally, we observed that networks trained to be provably robust against attacks are more robust than the standard networks against Wasserstein attacks, however we show that the current state of provable defenses is insufficient to directly apply to the Wasserstein ball due to their reliance on interval bounds.
We believe overcoming this roadblock is crucial to the development of verifiers or provable defenses against not just the Wasserstein attack, but also to improve the robustness of classifiers against other attacks that do not naturally convert to interval bounds (e.g. or attacks). Whether we can develop efficient verification or provable training methods that do not rely on interval bounds remains an open question.
Perhaps the most natural future direction for this work is to begin to understand the properties of Wasserstein adversarial examples and what we can do to mitigate them, even if only at a heuristic level. However, at the end of the day, the Wasserstein threat model defines just one example of a convex region capturing structure that is different from balls. By no means have we characterized all reasonable adversarial perturbations, and so a significant gap remains in determining how to rigorously define general classes of adversarial examples that can characterize phenomena different from the and Wasserstein balls.
Finally, although we focused primarily on adversarial examples in this work, the method of projecting onto Wasserstein balls may be applicable outside of deep learning. Projection operators play a major role in optimization algorithms beyond projected gradient descent (e.g. ADMM and alternating projections). Perhaps even more generally, the techniques in this paper could be used to derive Sinkhorn-like algorithms for classes of problems that consider Wasserstein constrained variables.
Proceedings of the Thirty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-18), pp. 162–171, Corvallis, Oregon, 2018. AUAI Press.
For convenience, we multiply the objective by and solve this problem instead:
Introducing dual variables where , the Lagrangian is
The KKT optimality conditions are now
so at optimality, we must have
Plugging in the optimality conditions, we get
so the dual problem is to maximize over . ∎
These equations follow directly from the KKT optimality conditions from Equation (18). ∎
To derive the algorithm, note that since this is a strictly convex problem to get the and iterates we solve for setting the gradient to 0. The derivative with respect to is
and so setting this to 0 and solving for gives the iterate. The derivative with respect to is
and setting this to 0 and solving for gives the iterate (this step can be done using a symbolic solver, we used Mathematica). Lastly, the updates are straightforward scalar calculations of the derivative and second derivative.
Recall from the transformation of dual to primal variables from Lemma 2. To see how the Projected Sinkhorn iteration is a (modified) matrix scaling algorithm, we can interpret these quantities before optimality as primal iterates. Namely, at each iteration , let
so the step rescales the transport matrix to sum to . Similarly, after an update for , we have that
which is a rescaling of the transport matrix to sum to the projected value. Lastly, the numerator of the step can be rewritten as
as a simple adjustment based on whether the current transport plan is above or below the maximum threshold .
During adversarial training for MNIST, we adopt an adaptive scheme to avoid selecting a specific . Specifically, to find an adversarial example, we first let on the first iteration of projected gradient descent, and increase it by a factor of every 5 iterations. We terminate the projected gradient descent algorithm when either an adversarial example is found, or when 50 iterations have passed, allowing to take on values in the range
To update the model weights during adversarial training, we use the SGD optimizer with 0.9 momentum and 0.0005 weight decay, and batch sizes of 128. We begin with a learning rate of 0.1, reduce it to 0.01 after 10 epochs.
We also use an adaptive scheme for adversarial training in CIFAR10. Specifically, we let on the first iteration of projected gradient descent, and increase it by a factor of 1.5 every 5 iterations. Similar to MNIST, we terminate the projected gradient descent algorithm when either an adversarial example is found, or 50 iterations have passed, allowing to take on values in the range .
Similar to MNIST, to update the model weights, we use the SGD optimizer with 0.9 momentum and 0.0005 weight decay, and batch sizes of 128. The learning rate is also the same as in MNIST, starting at 0.1, and reducing to 0.01 after 10 epochs.
A commonly asked question of models trained to be robust against adversarial examples is “what if the adversary has a perturbation budget of instead of ?” This is referring to a “robustness cliff,” where a model trained against an strong adversary has a sharp drop in robustness when attacked by an adversary with a slightly larger budget. To address this, we advocate for the slightly modified version of typical adversarial training used in this work: rather than picking a fixed and running projected gradient descent, we instead allow for an adversarial to have a range of . To do this, we begin with , and then gradually increase it by a multiplicative factor until either an adversarial example is found or until is reached. While similar ideas have been used before for evaluating model robustness, we specifically advocate for using this schema during adversarial training. This has the advantage of extending robustness of the classifier beyond a single threshold, allowing a model to achieve a potentially higher robustness threshold while not being significantly harmed by “impossible” adversarial examples.
In this section, we explore the space of possible parameters that we treated as fixed in the main paper. While this is not an exhaustive search, we hope to provide some intuition as to why we chose the parameters we did.
We first study the effect of and the cost matrix . First, note that could be any positive value. Furthermore, note that to construct we used the 2-norm which reflects the 1-Wasserstein metric, but in theory we could use any -Wasserstein metric, where the the cost of moving from pixel to is . Figure 8 shows the effects of and on both the adversarial example and the radius at which it was found for varying values of and .
We find that it is important to ensure that is large enough, otherwise the projection of the image is excessively blurred. In addition to qualitative changes, smaller seems to make it harder to find Wasserstein adversarial examples, making the radius go up as gets smaller. In fact, for and almost all of , the blurring is so severe that no adversarial example can be found.
In contrast, we find that increasing for the Wasserstein distance used in the cost matrix seems to make the images more “blocky”. Specifically, as gets higher tested, more pixels seem to be moved in larger amounts. This seems to counteract the blurring observed for low to some degree. Naturally, the radius also grows since the overall cost of the transport plan has gone up.
In this section we explore the effects of different sized transport plans. In the main paper, we used a local transport plan, but this could easily be something else, e.g. or . We can see a comparison on the robustness of a standard and the robust model against these different sized transport plans in Figure 7, using . We observe that while transport plans have difficulty attacking the robust MNIST model, all other plan sizes seem to have similar performance.
In this section we show how a Sinkhorn-like algorithm can be derived for provable defenses, and that the resulting algorithm is actually just a simplified version of the Projected Sinkhorn iteration, which we call the Conjugate Sinkhorn iteration (since it solves the conjugate problem).
By subtracting the same entropy term to the conjugate objective from Equation (14), we can get a problem similar to that of projecting onto the Wasserstein ball.
where again we’ve multiplied the objective by for convenience. Following the same framework as before, we introduce dual variables where , to construct the Lagrangian as
Note that since all the terms with are the same, the corresponding KKT optimality condition for also remains the same. The only part that changes is the optimality condition for , which becomes
Plugging the optimality conditions into the Lagrangian, we get the following dual problem:
Finally, if we minimize this with respect to and we get exactly the same update steps as the Projected Sinkhorn iteration. Consequently, the Conjugate Sinkhorn iteration is identical to the Projected Sinkhorn iteration except that we replace the step with the fixed value .