Improving DNN Robustness to Adversarial Attacks using Jacobian Regularization

03/23/2018
by   Daniel Jakubovitz, et al.
Tel Aviv University
0

Deep neural networks have lately shown tremendous performance in various applications including vision and speech processing tasks. However, alongside their ability to perform these tasks with such high accuracy, it has been shown that they are highly susceptible to adversarial attacks: a small change of the input would cause the network to err with high confidence. This phenomenon exposes an inherent fault in these networks and their ability to generalize well. For this reason, providing robustness to adversarial attacks is an important challenge in networks training, which has led to an extensive research. In this work, we suggest a theoretically inspired novel approach to improve the networks' robustness. Our method applies regularization using the Frobenius norm of the Jacobian of the network, which is applied as post-processing, after regular training has finished. We demonstrate empirically that it leads to enhanced robustness results with a minimal change in the original network's accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/12/2019

Robust Design of Deep Neural Networks against Adversarial Attacks based on Lyapunov Theory

Deep neural networks (DNNs) are vulnerable to subtle adversarial perturb...
04/30/2020

Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness

Mode connectivity provides novel geometric insights on analyzing loss la...
09/07/2018

Metamorphic Relation Based Adversarial Attacks on Differentiable Neural Computer

Deep neural networks (DNN), while becoming the driving force of many nov...
10/29/2020

Beyond cross-entropy: learning highly separable feature distributions for robust and accurate classification

Deep learning has shown outstanding performance in several applications ...
05/23/2017

Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation

Recent work has shown that state-of-the-art classifiers are quite brittl...
04/03/2021

Property-driven Training: All You (N)Ever Wanted to Know About

Neural networks are known for their ability to detect general patterns i...
02/23/2019

A Deep, Information-theoretic Framework for Robust Biometric Recognition

Deep neural networks (DNN) have been a de facto standard for nowadays bi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are a widespread machine learning technique, which has shown state-of-the-art performance in many domains such as natural language processing, computer vision and speech processing

[1]. Alongside their outstanding performance, deep neural networks have recently been shown to be vulnerable to a specific kind of attacks, most commonly referred to as Adversarial Attacks. These cause significant failures in the networks’ performance by performing just minor changes in the input data that are barely noticeable by a human observer and are not expected to change the prediction [2]. These attacks pose a possible obstacle for mass deployment of systems relying on deep learning in sensitive fields such as security or autonomous driving, and expose an inherent weakness in their reliability.

In adversarial attacks, very small perturbations in the network’s input data are performed, which lead to classifying an input erroneously with a high confidence. Even though these small changes in the input cause the model to err with high probability, they are unnoticeable to the human eye in most cases. In addition, it has been shown in

[3] that such adversarial attacks tend to generalize well across models. This transferability trait only increases the possible susceptibility to attacks since an attacker might not need to know the structure of the specific attacked network in order to fool it. Thus, black-box attacks are highly successful as well. This inherent vulnerability of DNNs is somewhat counter intuitive since it exposes a fault in the model’s ability to generalize well in very particular cases.

Lately, this phenomenon has been the focus of substantial research, which has focused on effective attack methods, defense methods and theoretical explanations to this inherent vulnerability of the model. Attack methods aim to alter the network’s input data in order to deliberately cause it to fail in its task. Such methods include DeepFool [4], Fast Gradient Sign Method (FGSM) [2], Jacobian-based Saliency Map Attack (JSMA) [5], Universal Perturbations [6]

, Adversarial Transformation Networks

[7], and more [8].

Several defense methods have been suggested to increase deep neural networks’ robustness to adversarial attacks. Some of the strategies aim at detecting whether an input image is adversarial or not (e.g., [9, 10, 11, 12, 13, 14]). For example, the authors in [12] suggested to detect adversarial examples using feature squeezing, whereas the authors in [14]

proposed to detect adversarial examples based on density estimates and Bayesian uncertainty estimates. Other strategies focus on making the network more robust to perturbed inputs. The latter, which is the focus of this work, aims at increasing the network’s accuracy in performing its original task even when it is being fed with perturbed data, intended to mislead it. This increased model robustness has been shown to be achieved by several different methods.

These defense methods include, among others, Adversarial Training [2]

which adds perturbed inputs along with their correct labels to the training dataset; Defensive Distillation

[15], which trains two networks, where the first is a standard classification network and the second is trained to achieve an output similar to the first network in all classes; the Batch Adjusted Network Gradients (BANG) method [16], which balances gradients in the training batch by scaling up those that have lower magnitudes; Parseval Networks [17] which constrain the Lipschitz constant of each hidden layer in a DNN to be smaller than 1; the Ensemble method [18], which takes the label that maximizes the average of the output probabilities of the classifiers in the ensemble as the predicted label; a Robust Optimization Framework [19], which uses an alternating minimization-maximization procedure in which the loss of the network is minimized over perturbed examples that are generated at each parameter update; Virtual Adversarial Training (VAT) [20], which uses a regularization term to promote the smoothness of the model distribution; Input Gradient Regularization [21] which regularizes the gradient of the cross-entropy loss, and Cross-Lipschitz Regularization [22], which regularizes all the combinations of differences of the gradients of a network’s output w.r.t its input. In another recent work [23], the authors suggested an adversarial training procedure that achieves robustness with guarantees on its statistical performance.

In addition to these works, several theoretical explanations for adversarial examples have been suggested. In [2], the authors claim that linear behavior in high-dimensional spaces creates this inherent vulnerability to adversarial examples. In [24], a game theoretical framework is used to study the relationship between attack and defense strategies in recognition systems in the context of adversarial attacks. In [25], the authors examine the transferability of adversarial examples between different models and find that adversarial examples span a contiguous subspace of large dimensionality. The authors also provide an insight into the decision boundaries of DNNs. In [26], the authors claim that first order attacks are universal and suggest the Projected Gradient Descent (PGD) attack which relies on this notion. They also claim that networks require a significantly larger capacity in order to be more robust to adversarial attacks. In another recent work [27], the authors show that the gradient of a network’s objective function grows with the dimension of its input and conclude that the adversarial vulnerability of a network increases with the dimension of its input.

In [28], the authors showed the relationship between a network’s sensitivity to additive adversarial perturbations and the curvature of the classification boundaries. In addition, they propose a method to discriminate between the original input and perturbed inputs. In [29], the link between a network’s robustness to adversarial perturbations and the geometry of the decision boundaries of this network is further developed. Specifically, it is shown that when the decision boundary is positively curved, small universal perturbations are more likely to fool the classifier. However, a direct application of this insight to increase the networks’ robustness to adversarial examples is, to the best of our knowledge, still unclear.

In a recent work [30], a relationship between the norm of the Jacobian of the network and its generalization error has been drawn. The authors have shown that by regularizing the Frobenius norm of the Jacobian matrix of the network’s classification function, a lower generalization error is achieved. In [31]

the authors show that using the Jacobian matrix computed at the logits (before the softmax operation) instead of the probabilities (after the softmax operation) yields better generalization results.

Inspired by the work in [30], we take this notion further and show that using Jacobian regularization as post-processing, i.e. applying it for a second phase of additional training after regular training has finished, also increases deep neural networks’ robustness to adversarial perturbations. Besides the relationship to the generalization error, we show also that the Forbenius norm of the Jacobian at a given point is related to its distance to the closest adversarial example and to the curvature of the network’s decision boundaries. All these connections provide a theoretical justification to the usage of the Jacobian regularization for decreasing the vulnerability of a network to adversarial attacks.

We apply the Jacobian regularization as post-processing to the regular training, after the network is stabilized with a high test accuracy, thereby allowing to use our strategy with existing pre-trained networks and improve their robustness. In addition, using the Jacobian regularization requires only little additional computational resources as it makes a single additional back-propagation step in each training step, as opposed to other methods that are very computationally demanding such as Distillation [15] which requires the training of two networks.

Two close techniques to our strategy are the Input Gradient regularization technique proposed in [21] and the Cross-Lipschitz regularization proposed in [22]. Our approach differs from the former work by the fact that we regularize the Frobenius norm of the Jacobian matrix of the network itself, and not the norm of the gradient of the cross-entropy loss. Our work differs from the latter work by the fact that we regularize the gradients of the network themselves and not all combinations of their differences, which yields better results at a lower computational cost, as will be later shown.

We compare the methods mentioned above and adversarial training [2] to Jacobian regularization on the MNIST, CIFAR-10 and CIFAR-100 datasets, demonstrating the advantage of our strategy in the form of high robustness to the DeepFool [4], FGSM [2], and JSMA [5] attack methods. Our method surpasses the results of the other strategies on FGSM and DeepFool and achieves competitive performance on JSMA. We also show that using Jacobian regularization combined with adversarial training further improves the robustness results.

This paper is organized as follows. Section 2 introduces the Jacobian regularization method and related strategies. Section 3 shows its connection to some theory of adversarial examples. The relationships drawn in this section suggest that regularizing the Jacobian of deep neural networks can improve their robustness to adversarial examples. In Section 4 we demonstrate empirically the advantages of this approach. Section 5 concludes our paper. In the appendices we provide more theoretical insight and additional experimental results.

2 Jacobian Regularization for Adversarial Robustness

Adversarial perturbations are essentially small changes in the input data which cause large changes in the network’s output. In order to prevent this vulnerability, during the post-processing training phase we penalize large gradients of the classification function with respect to the input data. Thus, we encourage the network’s learned function to be more robust to small changes in the input space. This is achieved by adding a regularization term in the form of the Frobenius norm of the network’s Jacobian matrix evaluated on the input data. The relation between the Frobenius norm and the (spectral) norm of the Jacobian matrix has been shown in [30], and lays the justification for using the Frobenius norm of the network’s Jacobian as a regularization term. We emphasize that we apply this regularization as additional post-processing training which is done after the regular training has finished.

To describe the Jacobian regularization more formally, we use the following notation. Let us denote the network’s input as a

-dimensional vector, its output as a

-dimensional vector, and let us assume the training dataset consists of training examples. We use the index to specify a certain layer in a network with layers. is the output of the layer of the network and is the output of the neuron in this layer. In addition, let us denote by

the hyper-parameter which controls the weight of our regularization penalty in the loss function. The input to the network is

(1)

and its output is , where the predicted class for an input is , .

is the network’s output after the softmax operation where is the output of the last fully connected layer in the network for the input . The term is the Jacobian matrix of layer evaluated at the point , i.e. . Correspondingly, is the row in the matrix .

A network’s Jacobian matrix is given by

(2)

where . Accordingly, the Jacobian regularization term for an input sample is

(3)

Combining the regularization term in (3) with a standard cross-entropy loss function on the training data, we get the following loss function for training:

(4)

where is a one-hot vector representing the correct class of the input .

The Input Gradient regularization method from [21] uses the following regularization term:

(5)

The Cross-Lipschitz regularization method from [22] uses the following regularization term:

(6)

The adversarial training method [2] adds perturbed inputs along with their correct labels to the training dataset, so that the network learns the correct labels of perturbed inputs during training. This helps the network to achieve a higher accuracy when it is being fed with new perturbed inputs, meaning the network becomes more robust to adversarial examples.

On the computational complexity aspect, Jacobian regularization introduces an overhead of one additional back-propagation step in every iteration. This step involves the computation of mixed partial derivatives, as the first derivative is w.r.t the input, and the second is w.r.t. the model parameters. However, one should keep in mind that Jacobian regularization is applied as a post-processing phase, and not throughout the entire training, which is computationally beneficial. Moreover, it is also more efficient than the Cross-Lipschitz regularization technique [22], which requires the computation of the norm of terms as opposed to our method that only requires the calculation of the norm of different gradients. This makes Jacobian regularization more scalable for datasets with a large .

3 Theoretical Justification

3.1 The Jacobian matrix and adversarial perturbations

In essence, for a network performing a classification task, an adversarial attack (a fooling method) aims at making a change as small as possible, which changes the network’s decision. In other words, finding the smallest perturbation that causes the output function to cross a decision boundary to another class, thus making a classification error. In general, an attack would seek for the closest decision boundary to be reached by an adversarial perturbation in the input space. This makes the attack the least noticeable and the least prone to being discovered [2].

To gain some intuition for our proposed defense method, we start with a simple informal explanation on the relationship between adversarial perturbations and the Jacobian matrix of a network. Let be a given input data sample; a data sample close to from the same class that was not perturbed by an adversarial attack; and another data sample, which is the result of an adversarial perturbation of that keeps it close to it but with a different predicted label. Therefore, we have that for the distance metric in the input and output of the network

(7)

with a high probability. Therefore,

(8)

Let be the -dimensional line in the input space connecting and . According to the mean value theorem there exists some such that

(9)

This suggests that a lower Frobenius norm of the network’s Jacobian matrix encourages it to be more robust to small changes in the input space. In other words, the network is encouraged to yield similar outputs for similar inputs.

We empirically examined the average values of the Frobenius norm of the Jacobian matrix of networks trained with various defense methods on the MNIST dataset. The network architecture is described in Section 4. Table 1 presents these values for both the original inputs and the ones which have been perturbed by DeepFool [4]. For “regular” training with no defense, it can be seen that as predicted, the aforementioned average norm is significantly larger on perturbed inputs. Interestingly enough, using adversarial training, which does not regularize the Jacobian matrix directly, decreases the average Frobenius norm of the Jacobian matrix evaluated on perturbed inputs (second row of Table 1). Yet, when Jacobian regularization is added (with ), this norm is reduced much more (third and fourth rows of Table 1). Thus, it is expected to improve the robustness of the network even further. Indeed, this behavior is demonstrated in Section 4.

Defense method
No defense
Adversarial Training
Jacobian regularization
Jacobian regularization Adversarial Training
Table 1: Average Frobenius norm of the Jacobian matrix at the original data and the data perturbed by DeepFool. DNN is trained on MNIST with various defense methods.

3.2 Relation to classification decision boundaries

As shown in [4], we may locally treat the decision boundaries as hyper-surfaces in the -dimensional output space of the network. Let us denote as a hyper-plane tangent to such a decision boundary hyper-surface in the input space. Using this notion, the following lemma approximates the distance between an input and a perturbed input classified to be at the boundary of a hyper-surface separating between the class of , , and another class .

Lemma 1

The first order approximation for the distance between an input , with class , and a perturbed input classified to the boundary hyper-surface separating the classes and for an distance metric is given by

(10)

This lemma is given in [4]. For completeness, we present a short sketch of the proof in Appendix 0.A. Based on this lemma, the following corollary provides a proxy for the minimal distance that may lead to fooling the network.

Corollary 2

Let be the correct class for the input sample . Then the norm of the minimal perturbation necessary to fool the classification function is approximated by

(11)

To make a direct connection to the Jacobian of the network, we provide the following proposition:

Proposition 3

Let be the correct class for the input sample . Then the first order approximation for the norm of the minimal perturbation necessary to fool the classification function is lower bounded by

(12)

The proof of Proposition 3 is given in Appendix 0.B. The term in (12) is maximized by the minimization of the cross-entropy term of the loss function, since a DNN aspires to learn the correct output with the largest confidence possible, meaning the largest possible margin in the output space between the correct classification and the other possible classes. The term in the denominator is the Frobenius norm of the Jacobian of the last fully connected layer of the network. It is minimized due to the Jacobian regularization part in the loss function. This is essentially a min-max problem, since we wish to maximize the minimal distance necessary to fool the network, . For this reason, applying Jacobian regularization during training increases the minimal distance necessary to fool the DNN, thus providing improved robustness to adversarial perturbations. One should keep in mind that it is important not to deteriorate the network’s original test accuracy. This is indeed the case as shown in Section 4.

An important question is whether the regularization of the Jacobian at earlier layers of the network would yield better robustness to adversarial examples. To this end, we examined imposing the regularization on the and the layers of the network. Both of these cases generally yielded degraded robustness results compared to imposing the regularization on the last layer of the network. Thus, throughout this work we regularize the Jacobian of the whole network. The theoretical details are given in Appendix 0.C and the corresponding experimental results are given in Appendix 0.D.

3.3 Relation to decision boundary curvature

In [29] the authors show the link between a network’s robustness to adversarial perturbations and the geometry of its decision boundaries. The authors show that when the decision boundaries are positively curved the network is fooled by small universal perturbations with a higher probability. Here we show that Jacobian regularization promotes the curvature of the decision boundaries to be less positive, thus reducing the probability of the network being fooled by small universal adversarial perturbations.

Let be the Hessian matrix of the network’s classification function at the input point for the class . As shown in [29], the decision boundary between two classes and can be locally referred to as the hyper-surface . Relying on the work in [32] let us use the approximation where is the row in the matrix . The matrix is a rank one positive semi-definite matrix. Thus, the curvature of the decision boundary is given by , which using the aforementioned approximation, can be approximated by

(13)

Thus, we arrive at the following upper bound for the curvature:

(14)
(15)

where the last inequality stems from the matrix norm inequality. For this reason the regularization of promotes a less positive curvature of the decision boundaries in the environment of the input samples. This offers a geometric intuition to the effect of Jacobian regularization on the network’s decision boundaries. Discouraging a positive curvature makes a universal adversarial perturbation less likely to fool the classifier.

4 Experiments

We tested the performance of Jacobian regularization on the MNIST, CIFAR-10 and CIFAR-100 datasets. The results for CIFAR-100, which are generally consistent with the results for MNIST and CIFAR-10, are given in Appendix 0.E. As mentioned before, we use the training with Jacobian regularization as a post-processing phase to the ”regular” training. Using a post-processing training phase is highly beneficial: it has a low additional computational cost as we add the regularization part after the network is already stabilized with a high test accuracy and not throughout the entire training. It also allows taking an existing network and applying the post-processing training phase to it in order to increase its robustness to adversarial examples. We obtained optimal results this way, whereas we found that applying the Jacobian regularization from the beginning of the training yields a lower final test accuracy.

The improved test accuracy obtained using post-processing training can be explained by the advantage of keeping the original training phase, which allows the network to train solely for the purpose of a high test accuracy. The subsequent post-processing training phase with Jacobian regularization introduces a small change to the already existing good test accuracy, as opposed to the case where the regularization is applied from the beginning that results in a worse test accuracy. Table 2 presents a comparison between post-processing training and ”regular” training on MNIST. Similar results are obtained for CIFAR-10 and CIFAR-100.

We examine the performance of our method using three different adversarial attack methods: DeepFool [4], FGSM [2] and JSMA [5]. We also assess the performance of our defense combined with adversarial training, which is shown to be effective in improving the model’s robustness. However, this comes at the cost of generating and training on a substantial amount of additional input samples as is the practice in adversarial training. We found that the amount of perturbed inputs in the training mini-batch has an impact on the overall achieved robustness. An evaluation of this matter appears in Appendix 0.F. The results for adversarial training, shown hereafter, are given for the amount of perturbed inputs that yields the optimal results in each test case. We also compare the results to the Input Gradient regularization technique [21] and the Cross-Lipschitz regularization technique [22].

Defense method Test accuracy
No defense x
Input Gradient regularization, ”regular” training 23.43 x
Input Gradient regularization, post-processing training x
Cross-Lipschitz regularization, ”regular” training x
Cross-Lipschitz regularization, post-processing training x
Jacobian regularization, ”regular” training x
Jacobian regularization, post-processing training x
Table 2: Effect of post-processing training vs. ”regular” training on MNIST, using different defense methods

For MNIST we used the network from the official TensorFlow tutorial

[33]

. The network consists of two convolutional layers, each followed by a max pooling layer. These layers are then followed by two fully connected layers. All these layers use the ReLU activation function, except for the last layer which is followed by a softmax operation. Dropout regularization with 0.5 keep probability is applied to the fully connected layers. The training is done using an Adam optimizer

[34] and a mini-batch size of 500 inputs. With this network we obtained a test accuracy of . Training with Jacobian regularization was done with a weight of , which we found to provide a good balance between the cross-entropy loss and the Jacobian regularization.

For CIFAR-10 we used a convolutional neural network consisting of four concatenated sets, where each set consists of two convolutional layers followed by a max pooling layer followed by dropout with a 0.75 keep probability. After these four sets, two fully connected layers are used. For CIFAR-10, training was done with a RMSProp optimizer

[35] and a mini-batch size of 128 inputs. With this network we obtained a test accuracy of . Training with Jacobian regularization was done with a weight of , which we found to provide a good balance between the cross-entropy loss and the Jacobian regularization.

The results of an ablation study regarding the influence of variation in the values of for MNIST and CIFAR-10 are given in Appendix 0.G.

4.1 DeepFool evaluation

We start by evaluating the performance of our method compared to the others under the DeepFool attack. The DeepFool attack [4] uses a first order approximation of the network’s decision boundaries as hyper-planes. Using this approximation, the method seeks for the closest decision boundary to be reached by a change in the input. Since the decision boundaries are not actually linear, this process continues iteratively until the perturbed input changes the network’s decision. The robustness metric associated with this attack is , which represents the average proportion between the norm of the minimal perturbation necessary to fool the network for an input and the norm of . This attack is optimized for the metric.

Table 3 and Table 4 present the robustness measured by under a DeepFool attack for MNIST and CIFAR-10 respectively. As the results show, Jacobian regularization provides a much more significant robustness improvement compared to the other methods. Substantially smaller perturbation norms are required to fool networks that use those defense approaches compared to networks that are trained using Jacobian regularization. Moreover, combining it with adversarial training further enhances this difference in the results.

Defense method Test accuracy
No defense x
Adversarial Training x
Input Gradient regularization 23.43 x
Input Gradient regularization Adversarial Training 23.49 x
Cross-Lipschitz regularization 29.03 x
Cross-Lipschitz regularization Adversarial Training 32.38 x
Jacobian regularization x
Jacobian regularization Adversarial Training x
Table 3: Robustness to DeepFool attack for MNIST
Defense method Test accuracy
No defense x
Adversarial Training x
Input Gradient regularization x
Input Gradient regularization Adversarial Training x
Cross-Lipschitz regularization x
Cross-Lipschitz regularization Adversarial Training x
Jacobian regularization x
Jacobian regularization Adversarial Training x
Table 4: Robustness to DeepFool attack for CIFAR-10

Notice that neither of the examined defense methods change the test accuracy significantly. For MNIST, the Jacobian and Cross-Lipschitz regularizations and adversarial training cause a small accuracy decrease, whereas the Input Gradient regularization technique improves the accuracy. Conversely, for CIFAR-10, the Jacobian and Cross-Lipschitz regularizations and adversarial training yield a better accuracy, whereas the Input Gradient regularization reduces the accuracy.

4.2 FGSM evaluation

The FGSM (Fast Gradient Sign Method) attack [2] was designed to rapidly create adversarial examples that could fool the network. The method changes the network’s input according to:

(16)

where represents the magnitude of the attack. This attack is optimized for the metric.

We examined the discussed defense methods’ test accuracy under the FGSM attack (test accuracy on the perturbed dataset) for different values of . Fig. 1 presents the results comparing Jacobian regularization to adversarial training, Input Gradient regularization and Cross-Lipschitz regularization. In all cases, the minimal test accuracy on the original test set using Jacobian regularization is for MNIST and for CIFAR-10.

Similarly to the results under the DeepFool attack, the results under the FGSM attack show that the test accuracy with the Jacobian regularization defense is higher than with the Input Gradient and Cross-Lipschitz regularizations or with adversarial training. Moreover, if adversarial training is combined with Jacobian regularization, its advantage over using the other techniques is even more distinct. This leads to the conclusion that the Jacobian regularization method yields a more robust network to the FGSM attack.

(a) MNIST
(b) CIFAR-10
Figure 1: Test accuracy for FGSM attack on MNIST (left) and CIFAR-10 (right) for different values of

4.3 JSMA evaluation

The JSMA (Jacobian-based Saliency Map Attack) [5] attack relies on the computation of a Saliency Map, which outlines the impact of every input pixel on the classification decision. The method picks at every iteration the most influential pixel to be changed such that the likelihood of the target class is increased. We leave the mathematical details to the original paper. Similarly to FGSM, represents the magnitude of the attack. The attack is repeated iteratively, and is optimized for the metric.

We examined the defense methods’ test accuracy under the JSMA attack (test accuracy on the perturbed dataset) for different values of . Fig. 2

presents the results for the MNIST and CIFAR-10 datasets. The parameters of the JSMA attack are 80 epochs, 1 pixel attack for the former and 200 epochs, 1 pixel attack, for the latter. In all cases, the minimal test accuracy on the original test set using Jacobian regularization is

for MNIST and for CIFAR-10.

(a) MNIST
(b) CIFAR-10
Figure 2: Test accuracy for JSMA (1 pixel) attack on MNIST with 80 epochs (left) and CIFAR-10 with 200 epochs (right) for different values of

Our method achieves superior results compared to the other three methods on CIFAR-10. On the other hand, on MNIST we obtain an inferior performance compared to the Input Gradient regularization method, though we obtain a better performance compared to Cross-Lipschitz regularization. Thus, we conclude that our defense method is effective under the JSMA attack in some cases and presents competitive performance overall. We believe that the reason behind the failure of our method in the MNIST case can be explained by our theoretical analysis. In the formulation of the Jacobian regularization (based on the Frobenius norm of the Jacobian matrix), the metric that is being minimized is the norm. Yet, in the JSMA attack, the metric that is being targeted by the perturbation is the pseudo-norm as only one pixel is being changed in every epoch. We provide more details on this issue in Appendix 0.H.

5 Discussion and Conclusions

This paper introduced the Jacobian regularization method for improving DNNs’ robustness to adversarial examples. We provided a theoretical foundation for its usage and showed that it yields a high degree of robustness, whilst preserving the network’s test accuracy. We demonstrated its improvement in reducing the vulnerability to various adversarial attacks (DeepFool, FGSM and JSMA) on the MNIST, CIFAR-10 and CIFAR-100 datasets. Under all three examined attack methods Jacobian regularization exhibits a large improvement in the network’s robustness to adversarial examples, while only slightly changing the network’s performance on the original test set. Moreover, in general, Jacobian regularization without adversarial training is better than adversarial training without Jacobian regularization, whereas the combination of the two defense methods provides even better results. Compared to the Input Gradient regularization, our proposed approach achieves superior performance under two out of the three attacks and competitive ones on the third (JSMA). Compared to the Cross-Lipschitz regularization, our proposed approach achieves superior performance under all of the three examined attacks.

We believe that our approach, with its theoretical justification, may open the door to other novel strategies for defense against adversarial attacks.

In the current form of regularization of the Jacobian, its norm is evaluated at the input samples. We empirically deduced that the optimal results are obtained by applying the Jacobian regularization on the original input samples, which is also more efficient computationally, and not on perturbed input samples or on points in the input space for which the Frobenius norm of the Jacobian matrix is maximal. A future work may analyze the reasons for that.

Notice that in the Frobenius norm, all the rows of the Jacobian matrix are penalized equally. Another possible future research direction is providing a different weight for each row. This may be achieved by either using a weighted version of the Frobenius norm or by replacing it with other norms such as the spectral one. Note, though, that the latter option is more computationally demanding compared to our proposed approach.

Acknowledgment. This work is partially supported by the ERC-StG SPADE grant and the MAGNET MDM grant.

Appendix

Appendix 0.A Proof Sketch of Lemma 1

Lemma 1 The first order approximation for the distance between an input , with class , and a perturbed input classified to the boundary hyper-surface separating the classes and for an distance metric is given by

(17)

Proof Sketch. Let be a hyper-plane tangent to a decision boundary hyper-surface separating between two classes and in the input space. Let the point of tangency be such that . The distance between a point and the hyper-plane is given by

(18)

Since is on the boundary hyper-surface between the classes and it holds that

(19)

For a point , which is in the environment of

(20)

From (18) and (20) it follows that the first order approximation of the distance (for the metric) between an input , with class , and a perturbed input classified to the boundary hyper-surface separating the classes and is given by (17). ∎

Appendix 0.B Proof of Proposition 3

We reiterate Corollary 2 and Proposition 3 before proving the latter.

Corollary 2 Let be the correct class for the input sample . Then the norm of the minimal perturbation necessary to fool the classification function is approximated by

(21)

Proposition 3 makes a direct connection to the Jacobian of the network.

Proposition 3 Let be the correct class for the input sample . Then the first order approximation for the norm of the minimal perturbation necessary to fool the classification function is lower bounded by

(22)

Proof. Relying on Corollary 2 we get that

(23)
(24)
(25)

Since we get that

(26)

and accordingly,

(27)

as stipulated. ∎

Appendix 0.C Jacobian regularization of the network’s layer - Mathematical Analysis

To provide a bound for the layer of the network, we rely on the work in [36], which shows that fixating the weight matrix of the last fully connected layer in a deep neural network causes little to no loss of accuracy while allowing memory and computational benefits. Assuming that this layer corresponds to a weight matrix with orthonormal columns, it is possible to take another path in the proof of Proposition 3 and obtain a bound as a function of the Jacobian of the layer of the network. This bound is exactly as (22), but with instead of , as formulated in Proposition 4 hereafter. Note that a trivial application of the multiplicative matrix norm inequality on (22) leads to a bound with a factor of in the denominator since

(28)
(29)

where the last equality stems from the fact that for an orthonormal with columns.

Given that we have two similar bounds, one depending on the Jacobian of the whole network and one on the Jacobian of the layer, it is important to ask which of them should we regularize to render better robustness to the network. The experimental results for the regularization of the Jacobian of the layer of the network, with and without a fixed with orthonormal columns, are given in Appendix 0.D. In this section and the following we use the following additional notation:

  • The network’s layer is a fully connected layer consisting of neurons. We use the index for a specific neuron in this layer.

  • is the Jacobian matrix of layer of the network with respect to its input , is the row in this matrix.

  • We assume that no activation or other non-linear function is applied before the softmax. Thus, the relationship between the layer and the last layer of the network is as follows: , , , where is the column in the matrix and .

We introduce the following proposition to lay the theoretical foundation for the regularization of the Jacobian matrix of the layer of the network.

Proposition 4

Let be the correct class for the input sample and let , the weight matrix of the last fully connected layer in the network, have orthonormal columns. Then, the first order approximation for , the -norm of the minimal perturbation necessary to fool the classification function, is lower bounded as

(30)

Proof. Since the weight matrix has orthonormal columns, then for any it holds that and for any it holds that .

We remind the reader that according to Lemma 1, the first order approximation for the distance between an input , with class , and a perturbed input classified to the boundary hyper-surface separating the classes and is given by

(31)

Developing this equality further using the chain rule, we have

(32)
(33)
(34)

where the last inequality stems from the multiplicative matrix norm inequality. Since the columns of are orthonormal it holds that

(35)

Plugging (35) in (34) leads to

(36)

Let be the correct class for the input sample . Then, the first order approximation for the -norm of the minimal perturbation necessary to fool the network is exactly as lower bounded in (30). ∎

Appendix 0.D Jacobian regularization of the network’s and layers - Experiments

In this section we show the empirical results of a regularization based on the Frobenius norm of the Jacobian matrix of the and layers of the network ( and respectively).

Since the layer typically consists of substantially more neurons than the last layer, i.e. , the evaluation of the Jacobian matrix of the layer is much more computationally demanding. For example, in our network for MNIST classification, . Accordingly, when regularizing the Jacobian matrix of the layer we reduced the size of the training mini-batch to 50 inputs per mini-batch due to computational constraints.

This increase in the computational overhead required for the regularization of the Jacobian of the network’s layer is a significant factor in our decision to prefer the regularization of the last layer of the Jacobian matrix. Our choice of the last layer is further supported by the fact that it also leads to superior results under most attack methods compared to the regularization of the layer as we show hereafter.

We examine two cases: a case in which the weight matrix is fixed with orthonormal columns (i.e. not updated during training), and a case in which no restriction is imposed on and it is updated during training. For comparison, Table 5 (same as Table 3) presents the results achieved for MNIST under the DeepFool attack with the regularization of the Frobenius norm of the Jacobian of the last layer of the network.

Defense method Test accuracy
No defense x
Adversarial Training x
Jacobian regularization x
Jacobian regularization Adversarial Training x
Table 5: Robustness to DeepFool attack for MNIST, regularization based on the Jacobian of the last layer
Defense method Test accuracy
No defense x
Adversarial Training x
Jacobian regularization x
Jacobian regularization Adversarial Training x
Table 6: Robustness to DeepFool attack for MNIST – regularization of , with a fixed with orthonormal columns
Defense method Test accuracy
No defense x
Adversarial Training x
Jacobian regularization x
Jacobian regularization Adversarial Training x
Table 7: Robustness to DeepFool attack for MNIST – regularization of , updated during training

Tables 6 and 7 present the results for MNIST under the DeepFool attack with a regularization based on the Frobenius norm of the Jacobian of the layer of the network . Table 6 considers the case where the last layer of the network has a fixed weight matrix with orthonormal columns and Table 7 demonstrates the scenario where the weight matrix is updated during training.

Note that regularizing the Jacobian of the last layer of the network is significantly better compared to the cases where the layer is regularized. In addition, it is interesting to remark that when is updated during training, the test accuracy on the original dataset is higher compared to the case in which is fixed. Yet, the robustness results under the DeepFool attack are similar in both cases.

(a) FGSM attack
(b) JSMA attack
Figure 3: MNIST test accuracy under FGSM (left) and JSMA (right) attacks for different values of for Jacobian regularization of layer ; Jacobian regularization of layer with fixed with orthonormal columns; and Jacobian regularization of layer with updated during training

Comparisons of the test accuracies under the FGSM and JSMA attack methods are presented in Fig. 3. The JSMA attack was performed as a 1 pixel attack with 80 epochs.

Note that under the JSMA attack, the regularization of the network’s layer yields better results, whereas under the FGSM attack the regularization of the last layer of the network yields better results. Maintaining the weight matrix constant with orthonormal columns generally harms the robustness results in the former case and improves the robustness results in the latter case.

For completeness, we also examined the case where Jacobian regularization is applied to the layer of the network, i.e. regularizing the Frobenius norm of , both with and updated during training and with and fixed with and orthonormal columns respectively. This case is even more computationally demanding, e.g. in the case of MNIST our network’s layer consists of 3136 neurons. The empirical results under the DeepFool attack are given in Table 8 and Table 9.

Defense method Test accuracy
No defense x
Adversarial Training x
Jacobian regularization x
Jacobian regularization Adversarial Training x
Table 8: Robustness to DeepFool attack for MNIST – regularization of , with fixed and with and orthonormal columns respectively
Defense method Test accuracy
No defense x
Adversarial Training x
Jacobian regularization x
Jacobian regularization Adversarial Training x
Table 9: Robustness to DeepFool attack for MNIST – regularization of , and updated during training

As can be seen in the results, the obtained robustness is significantly lower when Jacobian regularization is applied to the layer of the network, compared to the layer and the last layer. The results show that Jacobian regularization becomes less effective when it is applied to earlier layers of the network. In the case where and