Lipschitz regularization of deep neural networks
Adversarial training is an effective method for improving robustness to adversarial attacks. We show that adversarial training using the Fast Signed Gradient Method can be interpreted as a form of regularization. We implemented a more effective form of adversarial training, which in turn can be interpreted as regularization of the loss in the 2-norm, ∇_x ℓ(x)_2. We obtained further improvements to adversarial robustness, as well as provable robustness guarantees, by augmenting adversarial training with Lipschitz regularization.READ FULL TEXT VIEW PDF
Lipschitz regularization of deep neural networks
Adversarial training is an effective method for improving robustness to adversarial attacks. We show that adversarial training using the Fast Signed Gradient Method (Goodfellow et al., 2014) can be interpreted as regularization by the average of the 1-norm of the gradient of the loss over the data,
The choice of norm for the adversarial perturbation can lead to different interpretations: using the 2-norm for adversarial training corresponds to
We consider Lipschitz regularization in §3. Write for the Lipschitz constant of loss of the model, . We found existing methods of Lipschitz regularization based on norms of weight matrices (Bartlett, 1996; Szegedy et al., 2013) to be ineffective. As an alternative, we consider a tractable Lipschitz regularization of the loss of the model, by taking the maximum over the data of the norm of the gradient of the loss of the model.
Moreover, we show in 3.2 that controls the adversarial robustness of the model. Thus we interpret adversarial training (in the 2-norm) augmented with Lipschitz regularization as minimization of the objective function
which we refer to as (tulip). In practice, outperforms and . For example on CIFAR-10, for a ResNeXt model, adversarial training alone reduced adversarial training error by 29% (measured at adversarial distance111Apologies for overloading ‘’ for both the loss and for norms: we hope the meaning is clear from context ) over an undefended model. In contrast, with Lipschitz regularization () reduces adversarial error by 42% over baseline. See Table 1
. We trained with hyperparametersand . Other values of and may work better; we did not tune these hyperparameters. See §4 for empirical results.
Improving robustness to adversarial samples is a first step towards model verification (Szegedy et al., 2013; Goodfellow et al., 2018). However robustness guarantees to adversarial samples are difficult to obtain, since in practice it is only possible to generate suboptimal adversarial attacks.
Adversarial samples are unlikely to occur randomly. Rather, they are generated by an adversary. Adversarial attacks are classified according to the amount of information available to the attacker. White box attacks occur when the attacker has full access to the loss, model, and label. Typically white box attacks are generated using loss gradients: these attacks include L-BFGS (Szegedy et al. (2013)), Fast Signed Gradient (Goodfellow et al. (2014)), Jacobian Saliency (Papernot et al. (2016a)), and Projected Gradient Descent (Madry et al. (2017)). Black box attacks rely on less information, using model outputs rather than model gradients. Black box attacks require more effort (Papernot et al. (2017)) to implement, but their brute force approach may make them more effective evading adversarial defences (Brendel et al. (2018)).
The recent review Goodfellow et al. (2018) discusses defences against adversarial attacks and their limitations. The earliest and most successful defense is adversarial training (Szegedy et al. (2013); Goodfellow et al. (2014); Tramèr et al. (2018); Madry et al. (2017)). Top entries in a recent adversarial defence competition (Kurakin et al. (2017)) used Ensemble Adversarial Training (Tramèr et al. (2018)), where a model is adversarially trained with inputs generated by an ensemble of other models.
In adversarial training, the model, , is trained to solve the minimax problem
However in practice this problem is not computationally feasible. Instead, (1) is approximated. A popular and effective approximation is the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014), which also defines an attack.
Other forms of defences against gradient based attacks besides adversarial training include (Papernot et al., 2017, 2016b) as well as adding stochastic noise to the model, using a non-differentiable classifier (Lu et al. (2017)
), or defense distillation (Hinton et al. (2015); Papernot et al. (2016c)). Gradient based methods may be less successful against black box attacks (Brendel et al. (2018)).
Other possible defences discussed in Goodfellow et al. (2018) include input validation and preprocessing, which would potentially allow adversarial samples to be recognized before being input to the model, and architecture modifications designed to improve robustness to adversarial samples. For more information we refer to the review (Goodfellow et al. (2018)) and the discussion of attack methods in (Brendel et al. (2018)).
A form of robustness guarantees for a network is provided by the global Lipschitz constant of the model. Weng et al. (2018) show that the Lipschitz constant of the model gives an certifiable minimum adversarial distance: a successful attack on image will have adversarial distance at least
where is the Lipschitz constant of the model, , and is the correct label of . Thus training models to have small Lipschitz constant could improve adversarial robustness (Hein & Andriushchenko (2017); Tsuzuku et al. (2018)). Oberman & Calder (2018)
recently showed that Lipschitz regularization leads to a proof of generalization. The Lipschitz constant of a model may be estimated using only the product of the norms of model weight matrices (Bartlett (1996); Szegedy et al. (2013)), which is independent of the data. Models have been trained using this estimate as a regularization term in (Cissé et al., 2017; Gouk et al., 2018; Miyato et al., 2018a; Tsuzuku et al., 2018).
Write for the correct label and for the classifier. An adversarial attack , is a perturbation of the input which leads to incorrect classification
Adversarial attacks seek to find the minimum norm attack vector, which is an intractable problem(Athalye et al., 2018). An alternative which permits loss gradients to be used, is to consider the attack vector of a given norm which most increases the loss, .
The solution of (3) can be approximated using the dual norm (Boyd & Vandenberghe, 2004, A.1.6). If the -norm is used, we recover the Signed Gradient (Goodfellow et al., 2014). However a different attack vector is obtained if we measure attacks in the 2-norm.
The optimal attack vector defined by (3) in a generic norm can be approximated to with the vector , where is the solution of
and is the dual norm. In particular is given by
Write and use the Taylor expansion of
Then we can approximate (3) by solving
In the case of the -norm, the dual norm is the 1-norm, and the solution is given by the Signed Gradient vector . In the case of the 2-norm the dual norm is itself the 2-norm and the solution of (6) is given by . ∎
The 2-norm attack vector, points in the direction of the gradient of the loss, while the signed gradient attack vector points in the direction of the optimal dual vector.
Adversarial training can be interpreted as minimizing
Left: Comparison of attack methods using error curves for undefended ResNeXt-34, on the CIFAR-10 test set. The data stability curve is black. A higher curve means more probability of error, so theprojected gradient method is the most effective attack. Right: Comparison of Iterative FGSM and gradient descent on a quadratic function in two dimensions. The search direction chosen by FGSM is not the direction of steepest ascent.
max test statistics
|Dataset||defense method||median||% Err at||median||% Err at|
The angle between and is given by
where is the input dimension. Because , this ratio is always between zero and one. On the networks we studied, the ratio above could be as small as 0.32. To illustrate, Figure 0(b) shows the angle between iterative FGSM and the iterative gradient ascent on a toy loss (convex quadratic) in two dimensions. In practice we find iterative attacks using the steepest ascent direction are more effective than iterative FGSM based attacks, see Section 4.2.
The Lipschitz constant of a function is given by
When is differentiable on a closed, bounded domain, , then
Here for vector value functions, , the induced matrix norm must be used, based on the norms for and (Horn et al., 1990, Chapter 5.6.4). The result is standard in analysis, it follows from the Mean Value Theorem and the definition of the derivative. Using (10), we can approximate the Lipschitz constant by testing on the data
Because the loss is a scalar, Lipschitz regularization of the loss is implemented by taking and minimizing the regularized loss function
The first term in () is the expected loss, and the second term is the approximation of the Lipschitz constant of the loss coming from (11
). During training with Stochastic Gradient Descent, both terms are evaluated over mini-batches.
Define the Lipschitz constant of the data (in the norm) to be
Table 4.1 lists the Lipschitz constant of the training data for common datasets, which are all small: all but one are below 1 in the norm.
The Lipschitz extension theorem (Valentine, 1945) says that given function values , there exists an extension which perfectly fits the data, and has the same Lipschitz constant, provided the appropriate norm are used on the and spaces. This can be done using, for example, the 2-norm for and the norm on the label space. In other norms, we can also make an extension, but the Lipschitz constant may increase (Johnson & Lindenstrauss, 1984). Of course, such a function may not be consistent with a given architecture.
The following Lemma shows that the Lipschitz constant of the loss function gives a robustness guarantee for the loss incurred by an adversarial perturbation of norm . An analogous formula gives the corresponding robustness result using the Lipschitz constant of the model (2).
Suppose the composed loss function is -Lipschitz continuous. Let be an adversarial perturbation of norm . Then
By Lipschitz continuity of
There are two cases for the left-hand side, depending on the sign. In both cases we obtain (13). ∎
If the goal is adversarial robustness, then regularization of the loss is just as effective (empirically) as regularizing the model, at a much lower cost. Since the loss is a scalar, regularizing by the Lipschitz constant of the loss is equivalent to corresponds to regularization of the model
in one direction. By the chain rule,
For example, when is the KL divergence, and when then
Thus, in this case, regularizing corresponds to regularization of in the direction .
The estimate (11) is a lower bound of the Lipschitz constant of the loss. It is well known that data independent upper bounds on the Lipschitz constant of the model are available (Bartlett, 1996) using the product of the norm of the weight matrices. See also Szegedy et al. (2013); Cissé et al. (2017); Gouk et al. (2018); Miyato et al. (2018b, a); Tsuzuku et al. (2018). Other estimates are also available, for example Weng et al. (2018) used Extreme value theory to estimate the local Lipschitz constant of a model.
Let be the weight matrix of the -th layer of a network comprised of layers, and suppose all non linearities of a network are at most 1-Lipschitz. Then via the chain rule and properties of induced matrix norms, it can be shown
with and . Certain conditions on the ’s must be met. For a proof with see Tsuzuku et al. (2018). A similar bound is available with and . However, for deep models, we found that this bound was much too large: on the models we considered, this estimate was on the order of to .
We considered two toy problems, using image classification on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton (2009)). We tested our methods on three networks, chosen to represent a broad range of architectures: AllCNN-C (Springenberg et al. (2014)), a 34 layer ResNet (He et al. (2016)), and a 34 layer ResNeXt (Xie et al. (2017)). Training and model details are provided in Appendix A.
We define the error curve of a model given an attack. The curve provides information about the robustness of a model to attacks of different norms.
The error curve of the model for the attack is the probability over the test data that an attack of size leads to a misclassification
See Figure 0(a) for error curves for a given model over a range of attacks. We also plot the data stability curve, which is the probability that a perturbation can move one data point in the direction of another data point with a different label (which can be interpreted as a very weak attack).
We can compare how two different models perform against various attacks using the model error curve. In Table 1 we report and , using the 2-norm on . These values corresponded to the test error, and noise which is slightly smaller than a human perceptible perturbation (see Figure 3). We also report the median distance which corresponds to the -intercept of 50% error on the curve.
We attacked each model on the test/validation set using six untargeted attack methods: gradient attack; projected gradient descent (constrained in ); the Fast Gradient Sign Method (FGSM) (Goodfellow et al. (2014)); Iterative FGSM (I-FGSM) (Kurakin et al. (2016)); DeepFool (Moosavi-Dezfooli et al. (2016)); and Boundary attack (Brendel et al., 2018). The first five methods are white-box attacks, while the last is a black-box attack. I-FGSM and the projected gradient attack are iterative methods, whereas FGSM and the gradient attack are single step. All attacks were implemented with Foolbox 1.3.1 (Rauber et al. (2017)). Hyperparameters were set to Foolbox defaults, except for the Boundary attack222 The Boundary attack is a computationally demanding attack, and so due to resource constraints we ran the Boundary attack for only 500 iterations per test image. The boundary attack should have better performance with more iterations.. For each image and attack method, each attack reports an adversarial distance (in ).
On each model, dataset, and regularization method, we tested all six attack methods on the entire test/validation set. We compared attack methods using the attack error curve. For example see Figure 0(a), where we plot attack error curves for each attack method on an undefended model. The attack error curve plots the percent of misclassified test images as a function of adversarial distance. We report against Euclidean distance, , because it is an often reported measure of adversarial robustness, although other choices (MSE, distance in -norm) are equally valid. Common adversarial metrics are easily read off the attack error curve. For example, median Euclidean adversarial distance occurs at the -intercept of 50% error. The percent error given a maximum adversarial distance (for example ) is also readily available.
On all models and for all defence methods studied, projected gradient descent (constrained in ) consistently outperformed the other attack methods: Projected gradient descent had the smallest mean adversarial distance and the highest attack error curve. See Figure 0(a) for an illustration. A close second was I-FGSM, the other iterative attack method tested. The next two best attack methods were gradient attack, followed by FGSM. The Boundary attack outperformed DeepFool. We observed the same ranking of attacks on all models and defences studied. For this reason in the following section, we only report model statistics using projected gradient descent, constrained in , which could also be regarded as the strongest attack of all the attacks listed.
Each model was trained with combinations of up to three adversarial defences. The methods were: (i) , the baseline undefended model; (ii) , adversarial training with FGSM; (iii) , adversarial training with 2-norm; each of can be augmented with Lipschitz regularization, which in the last case we call ; we also considered adding a final sigmoid layer to the network, prior to the .
The choice of sigmoid we choose is , and is inspired by (but not equivalent to) -estimators used in classical statistics as a robust estimator (Hampel et al., 2011, Chapter 2)
-estimators have been successfully used to normalize scores and improve robustness, for example in machine learning biometrics (Jain et al. (2005)). See Appendix A.1 for layer details.
Model robustness is evaluated on the entire test/validation set using the median adversarial distance (in ), and the percent misclassified at adversarial distance . We chose because at this magnitude attacks are still imperceptible to the human eye. We argue it is reasonable to ask that models classify images with imperceptible perturbations correctly. At attacks are perceptible, albeit only slightly. See Figure 3. We also plot the attack error curve for each model. These statistics were generated with the projected gradient attack. Table 1 and Figure 2 present results for ResNeXt-34. The best statistics are in bold.
Here we summarize our results for ResNeXt-34, the model studied with the greatest capacity, and defer results for the other models to Appendix B. Without adversarial perturbations, all ResNeXt-34 models achieve roughly 4% test error on CIFAR-10. However, the undefended (baseline, ) model achieves 54% test error at adversarial distance . Adversarial training via FGSM () reduces test error to 24.6%, whereas adversarial training () reduces test error to 13.5%. A combination of all defenses ( with ) further reduces test error to 12.1%. The models are ranked in the same order when instead measured with median adversarial distance. The model with all defenses has median adversarial distance six times that of the undefended model. FGSM () only doubles the median adversarial distance relative to the baseline undefended model. Figure 1(a) illustrates that this ranking of defenses holds over all distances of adversarial perturbations.
We observe a similar ranking on CIFAR-100. See for Figure 1(b). Unperturbed, all models achieve between 21% and 22% test error. Without adversarial defenses, ResNeXt-34 (4x32d) has a test error of 74% at adversarial distance . Adversarial training alone brings the test error down to 56.3% and 53.7%, with respectively FGSM and adversarial training. A combination of all defenses further reduces test error to 42.6%. Median adversarial distance increased from 0.05 on the undefended model to 0.14 on the model with all defenses.
In Table 1 we also report statistics measuring the model’s Lipschitz constant. The columns and give the maximum of these norms over the test/validation set. The norm of the product of weights is independent of test data, and is an upper bound on the global Lipschitz constant of the model. Employing all defenses dramatically decreases the norm of the model Jacobian on the test data, and hence improves model robustness. On CIFAR-10 the model with all defenses has Jacobian norm nearly 10 times smaller than the undefended model, whereas adversarial training only improves the Jacobian norm by a factor of three at most. On CIFAR-100, adversarial training alone does not appear to improve the norm of the Jacobian significantly. However a combination of all defenses decreases the norm of the model Jacobian by a factor of two.
In Appendix B we report results for all models and combinations of defense methods. Of the individual defenses by themselves, adversarial training ( or ) improves model robustness the most. We find adversarial training () to be more effective than FGSM (). We observe the same ranking of defense methods for AllCNN and ResNet-34. Adversarial training improves model robustness. However model robustness is further improved by adding Lipschitz regularization, which empirically decreases the Jacobian norm of the model on the test data.
Both adversarial training and Lipschitz regularization increase training time by a factor of no more than four. In contrast, adding a final layer to normalize the logits is nearly free, and consistently improves model robustness by itself.
Rather than using as the Lipschitz penalty, we also tried training models with direct estimates of the Lipschitz constant. We tried both the product of layer weight norms , and the tighter estimate . However, neither of these direct estimates were effective as regularizers. The gap between the empirical Lipschitz constant on the data (the modulus of continuity on the data), and the estimated Lipschitz constant is too large. See for example Table 1, where we report the maximum Jacobian norm and . These two statistics differ by at least four orders of magnitude. The estimate is worse, and is numerically infeasible for models with more than a few layers. For example, on the two 34-layer networks we studied, this estimate was at least , and was as large as . Another estimate of the local Lipschitz constant is available using a statistic from Extreme value theory (Weng et al. (2018)). However this estimate requires at a minimum many tens of model evaluations for each image, and so is not tractable as a Lipschitz estimate during training.
The authors thank Bill Tubbs, Alex Iannantuono and Aram Pooladian for their assistance designing the experimental pipeline. The authors acknowledge the support of a Google gift which was used to support Bilal Abbasi during a collaboration at Google Brain Montréal. Adam Oberman was partially supported by AFOSR grant FA9550-18-1-0167.
We used standard data augmentation for the CIFAR dataset, comprising of horizontal flips, and random crops of padded images, four pixels per side. We used square cutout (Devries & Taylor (2017)
) of width 16 on CIFAR-10, and width 8 on CIFAR-100, but no dropout. Batch normalization was used after every convolution layer. We used SGD with an initial learning rate of 0.1, momentum set to 0.9, and a batch size of 128. CIFAR-10 was trained for 200 epochs, dropping the learning rate by a factor of five after epochs 60, 120, and 180. On CIFAR-100, networks were trained for 300 epochs, and the learning rate was dropped by a factor of 10 after epochs 150 and 225. For CIFAR-10 weight decay (Tikhonov/regularization) was set to ; on CIFAR-100 it was .
For networks with Lipschitz regularization, the Lagrange multiplier of the excess Lipschitz term was set to . Adversarially trained models were trained with images perturbed to an distance of . We did not tune either of these hyperparameters.
For CIFAR-10, the ResNeXt architecture we used had a depth of 34 layers, cardinality 2 and width 32, with a basic residual block rather than a bottleneck. The branches (convolution groups) of the blocks were aggregated via a mean, rather than using a fully connected layer. For CIFAR-100 the architecture was the same, but had cardinality 4.
Prior to the final
layer, we found inserting a sigmoid activation function improved model robustness. In this case, the sigmoid layer comprised of first batch normalization (without learnable parameters), followed by the activation function, where is a single learnable parameter, common across all layer inputs.
Here we present complete results for all regularization types, on all models and datasets considered. Because adversarial training outperforms FGSM, we only report results for the former.
|Model||defense method||% Err at||median||% Err at||median||% Err at|
|Model||defense method||% Err at||median||% Err at||median||% Err at|