Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

02/22/2018 ∙ by Zhewei Yao, et al. ∙ The University of Texas at Austin berkeley college 0

Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The precise underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze how the landscape of the loss functional is different for large batch size training. We compute the true Hessian spectrum, without approximation, by back-propagating the second derivative. Our results on multiple networks show that, when training at large batch sizes, one tends to stop at points in the parameter space with noticeably higher/larger Hessian spectrum, i.e., where the eigenvalues of the Hessian are much larger. We then study how batch size affects robustness of the model in the face of adversarial attacks. All the results show that models trained with large batches are more susceptible to adversarial attacks, as compared to models trained with small batch sizes. Furthermore, we prove a theoretical result which shows that the problem of finding an adversarial perturbation is a saddle-free optimization problem. Finally, we show empirical results that demonstrate that adversarial training leads to areas with smaller Hessian spectrum. We present detailed experiments with five different network architectures tested on MNIST, CIFAR-10, and CIFAR-100 datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During the training of a Neural Network (NN), we are given a set of input data with the corresponding labels drawn from an unknown distribution . In practice, we only observe a set of discrete examples drawn from , and train the NN to learn this unknown distribution. This is typically a non-convex optimization problem, in which the choice of hyper-parameters would highly affect the convergence properties. In particular, it has been observed that using large batch size for training often results in convergence to points with poor convergence properties. The main motivation for using large batch is the increased opportunities for data parallelism which can be used to reduce training time [17, 14]. Recently, there have been several works that have proposed different methods to avoid the performance loss with large batch [17, 31, 33]. However, these methods do not work for all networks and datasets. This has motivated us to revisit the original problem and study how the optimization with large batch size affects the convergence behavior.

Figure 1: Top 20 eigenvalues of the Hessian is shown for C1 on CIFAR-10 (left) and M1 on MNIST (right) datasets. The spectrum is computed using power iteration with relative error of .

We first start by analyzing how the Hessian spectrum and gradient change during training for small batch and compare it to large batch size and then draw connection with robust training.

In particular, we aim to answer the following questions: Q1

How is the training for large batch size different than small batch size? Equivalently, what is the difference between the local geometry of the neighborhood that the model converges when large batch size is used as compared to small batch?

A1

We backpropagate the second-derivative and compute its spectrum during training. The results show that despite the arguments regarding prevalence of saddle-points plaguing optimization 

[7, 13], that is actually not the problem with large batch size training, even when batch size is increased to the gradient descent limit. In [20], an approximate numerical method was used to approximate the maximal eigenvalue at a point. Here, by directly computing the spectrum of the true Hessian, we show that large batch size progressively gets trapped in areas with noticeably larger spectrum (and not just the dominant eigenvalue). For details please see §2, especially Figs. 12 and 4.

Q2 What is the connection between robust optimization and large batch size training? Equivalently, how does the batch size affect the robustness of the model to adversarial perturbation?

A2 We show that robust optimization is antithetical to large batch training, in the sense that it favors areas with small spectrum (aka flat minimas). We show that points converged with large batch size are significantly more prone to adversarial attacks as compared to a model trained with small batch size. Furthermore, we show that robust training progressively favors the opposite, leading to points with flat spectrum and robust to adversarial perturbation. We provide empirical and theoretical proof that the inner loop of the robust optimization, where we find the worst case, is a saddle-free optimization problem almost everywhere. Details are discussed in §3, especially Table 1,  7 and Figs. 4, 6.

Limitations: We believe it is critical for every paper to clearly state limitations. In this work, we have made an effort to avoid reporting just the best results, and repeated all the experiments at least three times and found all the findings to be consistent. Furthermore, we performed the tests on multiple datasets and multiple models, including a residual network, to avoid getting results that may be specific to a particular test. The main limitation is that we do not propose a solution for large batch training. We do offer analytical insights into the relationship between large batch and robust training, but we do not fully resolve the problem of large batch training. There have been several approaches to increasing batch size proposed so far [17, 31, 33], but they only work for particular cases and require extensive hyper-parameter tuning. We are performing an in-depth follow up study to use the results of this paper to better guide large batch size training.

Related Work. Deep neural networks have achieved good performance for a wide range of applications. The diversity of the different problems that a DNN can be used for, has been related to their efficiency in function approximation [27, 8, 22, 2]. However the work of  [34] showed that not only the network can perform well on a real dataset, but it can also memorize randomly labeled data very well. Moreover, the performance of the network is highly dependent on the hyper-parameters used for training. In particular, recent studies have shown that Neural Networks can easily be fooled by imperceptible perturbations to input data [16]. Moreover, multiple studies have found that large batch size training suffers from poor generalization capability [17, 33].

Here we focus on the latter two aspects of training neural networks. [20] presented results showing that large batches converge to a “sharper minima”. It was argued that even if the sharp minima has the same training loss as the flat one, small discrepancies between the test data and the training data can easily lead to poor generalization performance [20, 10]. The fact that “flat minimas” generalize well goes back to the earlier work of [19]. The authors relate flat minima to the theory of minimum description length [28], and proposed an optimization method to actually favor flat minimas. There have been several similar attempts to change the optimization algorithm to find “better” regions [9, 6]. For instance, [6] proposed entropy-SGD, which uses Langevin dynamics to augment the loss functional to favor flat regions of the “energy landscape”. The notion of flat/sharpness does not have a precise definition. A detailed comparison of different metrics is discussed in [10], where the authors show that sharp minimas can also generalize well. The authors also argued that the sharpness can be arbitrarily changed by reparametrization of the weights. However, this won’t happen when considering the same model and just changing the training hyper-parameters which is the case here. In [31, 30], the authors proposed that the training can be viewed as a stochastic differential equation, and argued that the optimum batch size is proportional to the training size and the learning rate.

Figure 2:

The landscape of the loss is shown along the dominant eigenvector,

, of the Hessian for C1 on CIFAR-10 dataset. Here is a scalar that perturbs the model parameters along .

As our results show, there is an interleaved connection by studying when NNs do not work well. [32, 16]

found that they can easily fool a NN with very good generalization by slightly perturbing the inputs. The perturbation magnitude is most of the time imperceptible to human eye, but can completely change the networks prediction. They introduced an effective adversarial attack algorithm known as Fast Gradient Sign Method (FGSM). They related the vulnerability of the Neural Network to linear classifiers and showed that RBF models, despite achieving much smaller generalization performance, are considerably more robust to FGSM attacks. The FGSM method was then extended in 

[21] to an iterative FGSM, which performs multiple gradient ascend steps to compute the adversarial perturbation. Adversarial attack based on iterative FGSM was found to be stronger than the original one step FGSM. Various defenses have been proposed to resist adversarial attacks [26, 15, 18, 3, 12]. We will later show that there is an interleaved connection between robustness of the model and the large batch size problem.

The structure of this paper is as follows: We present the results by first analyzing how the spectrum changes during training, and test the generalization performances of the model for different batch sizes in  §2. In §3, we discuss details of how adversarial attack/training is performed. In particular, we provide theoretical proof that finding adversarial perturbation is a saddle-free problem under certain conditions, and test the robustness of the model for different batch sizes. Also, we present results showing how robust training affects the spectrum with empirical studies. Finally, in §4 we provide concluding remarks.

2 Large Batch, Generalization Gap and Hessian Spectrum

Setup: The architecture for the networks used is reported in Table 6. In the text, we refer to each architecture by the abbreviation used in this table. Unless otherwise specified, each of the batch sizes are trained until a training loss of 0.001 or better is achieved. Different batches are trained under the same conditions, and no weight decay or dropout is used.

We first focus on large batch size training versus small batch and report the results for large batch training for C1 network on CIFAR-10 dataset, and M1 network on MNIST are shown in Table 1, and Table 7, respectively. As one can see, after a certain point increasing batch size results in performance degradation on the test dataset. This is in line with results in the literature [20, 17].

As discussed before, one popular argument about large batch size’s poor generalization accuracy has been that large batches tend to get attracted to “sharp” minimas of the training loss. In [20] an approximate metric was used to measure curvature of the loss function for a given model parameter. Here, we directly compute the Hessian spectrum, using power iteration by back-propagating the matvec of the Hessian [25]. Unless otherwise noted, we continue the power iterations until a relative error of reached for each eigenvalue. The Hessian spectrum for different batches is shown in Fig. 1. Moreover, the value of the dominant eigenvalue, denoted by , is reported in Table 1, and Table 2, respectively (Additional result for MNIST tested using LeNet-5 is given in appendix. Please see Table 7). From Fig. 1, we can clearly see that for all the experiments, large batches have a noticeably larger Hessian spectrum both in the dominant eigenvalue as well as the rest of the 19 eigenvalues. However, note that curvature is a very local measure. It would be more informative to study how the loss functional behaves in a neighborhood around the point that the model has converged. To visually demonstrate this, we have plotted how the total loss changes when the model parameters are perturbed along the dominant eigenvector as shown in Fig. 2, and Fig. 7 for C1 and M1 models, respectively. We can clearly see that the large batch size models have been attracted to areas with higher curvature for both the test and training losses.

Figure 3: The landscape of the loss is shown when the C1 model parameters are changed along the first two dominant eigenvectors of the Hessian with the perturbation magnitude and .

This is reflected in the visual figures. We have also added a 3D plot, where we perturb the parameters of C1 model along both the first and second eigenvectors as shown in Fig. 3. The visual results are in line with the numbers shown for the Hessian spectrum (see ) in Table 1, and Table 7. For instance, note the value of for the training and test loss for in Table 1 and compare the corresponding results in Fig. 3.

A recent argument has been that saddle-points in high dimension plague optimization for neural networks [7, 13]. We have computed the dominant eigenvalue of the Hessian along with the total gradient during training and report it in Fig. 4. As we can see, large batch size progressively gets attracted to areas with larger spectrum, but it clearly does not get stuck in saddle points [23].

Batch Acc. Acc Acc
16 100  (77.68) 0.64  (32.78) 2.69  (200.7) 0.05  (20.41) 48.07  (30.38) 72.67  (42.70)
32 100  (76.77) 0.97  (45.28) 3.43  (234.5) 0.05  (23.55) 49.04  (31.23) 72.63  (43.30)
64 100  (77.32) 0.77  (48.06) 3.14  (195.0) 0.04  (21.47) 50.40  (32.59) 73.85  (44.76)
128 100  (78.84) 1.33  (137.5) 1.41  (128.1) 0.02  (13.98) 33.15  (25.2 ) 57.69  (39.09)
256 100  (78.54) 3.34  (338.3) 1.51  (132.4) 0.02  (14.08) 25.33  (19.99) 50.10  (34.94)
512 100  (79.25) 16.88  (885.6) 1.97  (100.0) 0.04  (10.42) 14.17  (12.94) 28.54  (25.08)
1024 100  (78.50) 51.67  (2372 ) 3.11  (146.9) 0.05  (13.33) 8.80  (8.40 ) 23.99  (21.57)

C1 Cifar-10

2048 100  (77.31) 80.18  (3769 ) 5.18  (240.2) 0.06  (18.08) 4.14  (3.77 ) 17.42  (16.31)
256 100  (79.20) 0.62  (28 ) 12.10  (704.0) 0.10  (41.95) 0.57  (0.38) 0.73  (0.47)
512 100  (80.44) 0.75  (57 ) 4.82  (425.2) 0.03  (26.14) 0.34  (0.25) 0.54  (0.38)
1024 100  (79.61) 2.36  (142) 0.523  (229.9) 0.04  (17.16) 0.27  (0.22) 0.46  (0.35)

C2 Cifar-10

2048 100  (78.99) 4.30  (307) 0.145  (260.0) 0.50  (17.94) 0.18  (0.16) 0.33  (0.28)
Table 1: Result on CIFAR-10 dataset using C1, C2 network. We show the Hessian spectrum of different batch training models, and the corresponding performances on adversarial dataset generated by training/testing dataset (testing result is given in parenthese).

3 Large Batch, Adversarial Attack and Robust training

We first give a brief overview of adversarial attack and robust training and then present results connecting these with large batch size training.

3.1 Robust Optimization and Adversarial Attack

Here we focus on white-box adversarial attack, and in particular the optimization-based approach both for the attack and defense. Suppose is a learning model (the neural network architecture), and are the input data and the corresponding labels. The loss functional of the network with parameter on is denoted by . For adversarial attack, we seek a perturbation (with a bounded or norm) such that it maximizes :

(1)

where is an admissibility set for acceptable perturbation (typically restricting the magnitude of the perturbation). A typical choice for this set is , a ball of radius centered at . A popular method for approximately computing , is Fast Gradient Sign Method [16], where the gradient of the loss functional is computed w.r.t. inputs, and the perturbation is set to:

(2)

This is not the only attack method possible. Other approaches include an iterative FGSM method (FGSM-10)[21] or using other norms such as norm instead of (We denote the method by in our results). Here we also use a second-order attack, where we use the Hessian w.r.t. input to precondition the gradient direction with second order information; please see Table 5 in Appendix for details.

One method to defend against such adversarial attacks, is to perform robust training [32, 24]:

(3)

Solving this min-max optimization problem at each iteration requires first finding the worst adversarial perturbation that maximizes the loss, and then updating the model parameters for those cases. Since adversarial examples have to be generated at every iteration, it would not be feasible to find the exact perturbation that maximizes the objective function. Instead, a popular method is to perform a single or multiple gradient ascents to approximately compute . After computing at each iteration, a typical optimization step (variant of SGD) is performed to update .

Next we show that solving the maximization part is actually a saddle-free problem almost everywhere. This property assures us that the Hessian w.r.t input does not have negative eigenvalue which allows us to use CG for performing Newton solver for our 2nd order adversarial perturbation tests in §3.4. 111This results might also be helpful for finding better optimization strategies for GANS.

3.2 Adversarial perturbation: A saddle-free problem

Recall that our loss functional is . We make following assumptions for the model to help show our theoretical result,

Assumption 1.

We assume the model’s activation functions are strictly ReLu activation, and all layers are either convolution or fully connected. Here, Batch Normalization layers are accepted. Note that even though the ReLu activation has discontinuity at origin, i.e.

, ReLu function is twice differentiable almost everywhere.

The following theorem shows that the problem of finding an adversarial perturbation that maximized , is a saddle-free optimization problem, with a Positive-Semi-Definite (PSD) Hessian w.r.t. input almost everywhere. For details on the proof please see Appendix. A.1.

Theorem 1.

With Assumption. 1, for a DNN, its loss functional is a saddle-free function w.r.t. input almost everywhere, i.e.

From the proof of Theorem 1, we could immediately get the following proposition of DNNs:

Proposition 2.

Based on Theorem 1 with Assumption 1, if the input and the number of the output class is , i.e. , then the Hessian of DNNs w.r.t. to is almost a rank matrix almost everywhere; see Appendix A.1 for details.

Batch Acc. Acc Acc
64 99.98  (70.81) 0.022  (10.43) 61.54  (34.48) 78.57  (39.94)
128 99.97  (70.9 ) 0.055  (26.50 ) 58.15  (33.73) 77.41  (38.77)
256 99.98  (68.6 ) 1.090  (148.29) 39.96  (28.37) 66.12  (35.02)
512 99.98  (68.6 ) 1.090  (148.29) 40.48  (28.37) 66.09  (35.02)
Table 2: Result on CIFAR-100 dataset using CR network. We show the Hessian spectrum of different batch training models, and the corresponding performances on adversarial dataset generated by training/testing dataset (testing result is given in parenthese).

3.3 Large Batch Training and Robustness

Here, we test the robustness of the models trained with different batches to an adversarial attack. We use Fast Gradient Sign Method for all the experiments (we did not see any difference with FGSM-10 attack). The adversarial performance is measured by the fraction of correctly classified adversarial inputs. We report the performance for both the training and test datasets for different values of ( is the metric for the adversarial perturbation magnitude in norm). The performance results for C1, and C2 models on CIFAR-10, CR model on CIFAR-100, are reported in the last two columns of Tables 1,and  2 (MNIST results are given in appendix, Table 7). The interesting observation is that for all the cases, large batches are considerably more prone to adversarial attacks as compared to small batches. This means that not only the model design affects the robustness of the model, but also the hyper-parameters used during optimization, and in particular the properties of the point that the model has converged to.

MEAN of Adv
99.32 60.37 77.27 14.32 82.04 33.21 53.44
99.49 96.18 97.44 63.46 97.56 83.33 87.59
99.5 96.52 97.63 66.15 97.66 84.64 88.52
98.91 96.88 97.39 86.23 97.66 92.56 94.14
99.45 94.41 96.48 52.67 96.89 77.58 83.60
98.72 95.02 96.49 77.18 97.43 90.33 91.29
Table 3: Accuracy of different models across different adversarial samples of MNIST, which are obtained by perturbing the original model
Figure 4:

Changes in the dominant eigenvalue of the Hessian w.r.t weights and the total gradient is shown for different epochs during training. Note the increase in

(blue curve) for large batch v.s. small batch. In particular, note that the values for total gradient along with the Hessian spectrum show that large batch does not get “stuck” in saddle points, but areas in the optimization landscape that have high curvature. More results are shown in Fig. 16. The dotted points show the corresponding results when using robust optimization, which makes the solver stay in areas with smaller spectrum.

From this result, there seems to be a strong correlation between the spectrum of the Hessian w.r.t. and how robust the model is. However, we want to emphasize that in general there is no correlation between the Hessian w.r.t. weights and the robustness of the model w.r.t. the input. For instance, consider a two variable function (we treat and as two single variables), for which the Hessian spectrum of has no correlation to robustness of w.r.t. . This can be easily demonstrated for a least squares problem, . It is not hard to see the Hessian of and are, and , respectively. Therefore, in general we cannot link the Hessian spectrum w.r.t. weights to robustness of the network. However, the numerical results for all the neural networks show that models that have higher Hessian spectrum w.r.t. are also more prone to adversarial attacks. A potential explanation for this would be to look at how the gradient and Hessian w.r.t. input (i.e. ) would change for different batch sizes. We have computed the dominant eigenvalue of this Hessian using power iteration for each individual input sample for both training and testing datasets. Furthermore, we have computed the norm of the gradient w.r.t. for these datasets as well. These two metrics are reported in , and ; see Table 1 for details. The results on all of our experiments show that these two metrics actually do not correlate with the adversarial accuracy. For instance, consider C1 model with . It has both smaller gradient and smaller Hessian eigenvalue w.r.t. as compared to , but it performs acidly worse under adversarial attack. One possible reason for this could be that the decision boundaries for large batches are less stable, such that with small adversarial perturbation the model gets fooled.

MEAN of Adv
79.46 15.25 4.46 12.37 29.64 22.93 16.93
71.82 63.05 63.44 57.68 66.04 62.36 62.51
71.14 63.32 63.88 58.25 65.95 62.70 62.82
63.52 59.33 59.73 57.35 60.44 58.98 59.16
74.34 47.65 43.95 38.45 62.75 55.77 49.71
71.59 50.05 46.66 42.95 62.87 58.42 52.19
Table 4: Accuracy of different models across different samples of CIFAR-10, which are obtained by perturbing the original model
Figure 5:

1-D Parametric plot for C3 model on CIFAR-10. We interpolate between parameters of

and , and compute the cross entropy loss on the y-axis.
Figure 6: Spectrum of the sub-sampled Hessian of the loss functional w.r.t. weights. The results are computed for different batch sizes, which are randomly chosen, of of C1.

3.4 Adversarial Training and Hessian Spectrum

In this part, we study how the Hessian spectrum and the landscape of the loss functional change after adversarial training is performed. Here, we fix the batch size (and all other optimization hyper-parameters) and use five different adversarial training methods as described in §3.1.

For the sake of clarity let us denote to be the test dataset which can be the original clean test dataset or one created by using an adversarial method. For instance, we denote to be the adversarial dataset generated by FGSM, and to be the original clean test dataset.

Setup: For the MNIST experiments, we train a standard LeNet on MNIST dataset [4] (using M1 network). For the original training, we set the learning rate to 0.01 and momentum to 0.9, and decay the learning rate by half after every 5 epochs, for a total of 100 epochs. Then we perform an additional five epochs of adversarial training with a learning rate of . The perturbation magnitude, , is set to for attack and for attack. We also present results for C3 model [5] on CIFAR-10, using the same hyper-parameters, except that the training is performed for 100 epochs. Afterwards, adversarial training is performed for a subsequent epochs with a learning rate of and momentum of (the learning rate is decayed by half after five epochs). Furthermore, the adversarial perturbation magnitude is set to for attack and for attack[29].

The results are shown in Table 3, 4. We can see that after adversarial training the model becomes more robust to these attacks. Note that the accuracy of different adversarial attacks varies, which is expected since the various strengths of different attack method. In addition, all adversarial training methods improve the robustness on adversarial dataset, though they lose some accuracy on , which is consistent with the observations in [16]. As an example, consider the second row of Table 3 which shows the results when FGSM is used for robust training. The performance of this model when tested against the attack method is 63.46% as opposed to 14.32% of the original model (). The rest of the rows show the results for different algorithms. In §A.4, we give further discussion about the performances of first order and second order attacks.

The main question here is how the landscape of the loss functional is changed after these robust optimizations are performed? We first show a 1-D parametric interpolation between the original model parameters and that of the robustified models, as shown in Fig. 5 (see Fig. 11 for all cases) and 10. Notice the robust models are at a point that has smaller curvature as compared to the original model. To exactly quantify this, we compute the spectrum of the Hessian as shown in Fig. 6, and 12. Besides the full Hessian spectrum, we also report the spectrum of sub-sampled Hessian. The latter is computed by randomly selecting a subset of the training dataset. We denote the size of this subset as to avoid confusion with the training batch size. In particular, we report results for and . There are several important observations here. First, notice that the spectrum of the robust models is noticeably smaller than the original model. This means that the min-max problem of Eq. 3 favors areas with lower curvature. Second, note that even though the total Hessian shows that we have converged to a point with positive curvature (at least based on the top 20 eigenvalues), but that is not necessarily the case when we look at individual samples (i.e. ). For a randomly selected batch of , we see that we have actually converged to a point that has both positive and negative curvatures, with a non-zero gradient (meaning it is not a saddle point). To the best of our knowledge this is a new finding, but one that is expected as SGD optimizes the expected loss instead of individual ones.

Now going back to Fig. 4, we show how the spectrum changes during training when we use robust optimization. We can clearly see that with robust optimization the solver is pushed to areas with smaller spectrum. This is a very interesting finding and shows the possibility of using robust training as a systematic means to bias the solver to avoid sharp minimas. A preliminary result is shown in Table 8, where we can see that robust optimization performs better for large batch size training as opposed to the baseline, or when we use the method proposed by [17]. However, we emphasize that the goal is to perform analysis to better understand the problems with large batch size training. More extensive tests are needed before one could claim that robust optimization performs better than other methods.

4 Conclusion

In this work, we studied NNs through the lens of the Hessian operator. In particular, we studied large batch size training and its connection with stability of the model in the presence of white-box adversarial attacks. By computing the Hessian spectrum we provided several points of evidence that show that large batch size training tends to get attracted to areas with higher Hessian spectrum. We reported the eigenvalues of the Hessian w.r.t. whole dataset, and plotted the landscape of the loss when perturbed along the dominant eigenvector. Visual results were in line with the numerical values for the spectrum. Our empirical results show that adversarial attacks/training and large batches are closely related. We provided several empirical results on multiple datasets that show large batch size training is more prone to adversarial attacks (more results are provided in the supplementary material). This means that not only is the model design important, but also that the optimization hyper-parameters can drastically affect a network’s robustness. Furthermore, we observed that robust training is antithetical to large batch size training, in the sense that it favors areas with noticeably smaller Hessian spectrum w.r.t. .

The results show that the robustness of the model does not (at least directly) correlate with the Hessian w.r.t. . We also found that this Hessian is actually a PSD matrix, meaning that the problem of finding the adversarial perturbation is actually a saddle-free problem almost everywhere for cases that satisfy this criterion 1. Furthermore, we showed that even though the model may converge to an area with positive curvature when considering all of the training dataset (i.e. total loss), if we look at individual samples then the Hessian can actually have significant negative eigenvalues. From an optimization viewpoint, this is due to the fact that SGD optimizes the expected loss and not the individual per sample loss.

References

Appendix A Appendix

a.1 Proof of Theorem 1

In this section, we give the proof of Theorem  1. The first thing we want to point out is that, although we prove the Hessians of these NNs are positive semi-definite almost everywhere, these NNs are not convex w.r.t. inputs, i.e., . The discontinuity of ReLU is the cause. (For instance, consider a combination of two step functions in 1-D, e.g. is not a convex function but has second derivative almost everywhere.) However, this has an important implication, that the problem is saddle-free.

Before we go to the proof of Theorem  1, let us prove the following lemma for cross-entropy loss with soft-max layer.

Lemma 3.

Let us denote by the input of the soft-max function, by the correct label of the inputs , by the soft-max function, and by the cross-entropy loss. Then we have

Proof.

Let , , and then it follows that

Further, it is not hard to see that

Then, the second order derivative of w.r.t. is

Since

we have

Now, let us give the proof of Theorem  1:

Assume the input of the soft-max layer is and the cross-entropy is

. Based on Chain Rule, it follows that

From Assumption. 1 we know that all the layers before the soft-max are either linear or ReLU, which indicates (a tensor) almost everywhere. Therefore, applying chain rule again for the above equation,

It is easy to see almost everywhere since from Lemma 3.

From above we could see that the Hessian of NNs w.r.t. is at most a rank (the number of class) matrix, since the rank of the Hessian matrix

is dominated by the term , which is at most rank .

a.2 Attacks Mentioned in Paper

In this section, we show the details about the attacks used in our paper. Please see Table 5 for details.

FGSM
FGSM-10 (iterate 10 times)
GRAD
FHSM
HESS
Table 5: The definition of all attacks used in the paper. Here and .

a.3 Models Mentioned in Paper

In this section, we give the details about the NNs used in our paper. For clarification, We omit the ReLu activation here. However, in practice, we implement ReLu regularity. Also, for all convolution layers, we add padding to make sure there is no dimension reduction. We denote Conv(a,a,b) as a convolution layer having b channels with a by a filters, MP(a,a) as a a by a max-pooling layer, FN(a) as a fully-connect layer with a output and SM(a) is the soft-max layer with a output. For our Conv(5,5,b) (Conv(3,3,b))layers, the stride is 2 (1). See Table 

6 for details of all models used in this paper.

Name Structure
C1 (for CIFAR-10)
Conv(5,5,64) – MP(3,3) – BN–Conv(5,5,64)
–MP(3,3)–BN–FN(384)–FN(192)–SM(10)
C2 (for CIFAR-10)
Conv(3,3,63)–BN–Conv(3,3,64)–BN–Conv(3,3,128)
–BN–Conv(3,3,128)–BN–FC(256)–FC(256)–SM(10)
C3 (for CIFAR-10)
Conv(3,3,64)–Conv(3,3,64)–Conv(3,3,128)
–Conv(3,3,128)–FC(256)–FC(256)–SM(10)
M1 (for MNIST) Conv(5,5,20)–Conv(5,5,50)–FC(500)–SM(10)
CR (for CIFAR-100) ResNet18 For CIFAR-100
Table 6: The definition of all models used in the paper.

a.4 Discussion on Second Order Method

Although second order adversarial attack looks well for MNIST (see Table 3), but for most our experiments on CIFAR-10 (see Table 4), the second order methods are weaker than variations of the gradient based methods. Also, notice that the robust models trained by second order method are also more prone to attack on CIFAR-10, particularly and . We give two potential explanation here.

First note that the Hessian w.r.t. input is a low rank matrix. In fact, as mentioned above, the rank of the input Hessian for CIFAR-10 is at most ten; see Proposition 2, the matrix itself is . Even though we use inexact Newton method [11] along with Conjugate Gradient solver, but this low rank nature creates numerical problems. Designing preconditioners for second-order attack is part of our future work. The second point is that, as we saw in the previous section the input Hessian does not directly correlate with how robust the network is. In fact, the most effective attack method would be to perturb the input towards the decision boundary, instead of just maximizing the loss.

a.5 More Numerical Result for §2 and 3

In this section, we provide more numerical results for §2 and 3. All conclusions from the numerical results are consistency with those in §2 and 3.

Batch Acc Acc Acc
64 100  (99.21) 0.49  (2.96 ) 0.07  (0.41) 0.007  (0.10) 0.53  (0.53) 0.85  (0.85)
128 100  (99.18) 1.44  (8.10 ) 0.10  (0.51) 0.009  (0.12) 0.50  (0.51) 0.83  (0.83)
256 100  (99.04) 2.71  (13.54) 0.09  (0.50) 0.008  (0.12) 0.45  (0.46) 0.81  (0.82)
512 100  (99.04) 5.84  (26.35) 0.12  (0.52) 0.010  (0.13) 0.42  (0.42) 0.79  (0.80)
1024 100  (99.05) 21.24  (36.96) 0.25  (0.42) 0.032  (0.11) 0.32  (0.33) 0.73  (0.74)
2048 100  (98.99) 44.30  (49.36) 0.36  (0.39) 0.075  (0.11) 0.19  (0.19) 0.72  (0.73)
Table 7: Result on MNIST dataset for M1 model (LeNet-5). We shows the Hessian spectrum of different batch training models, and the corresponding performances on adversarial dataset generated by training/testing dataset. The testing results are shown in parenthesis. We report the adversarial accuracy of three different magnitudes of attack. The interesting observation is that the is increasing while the adversarial accuracy is decreasing for fixed . Meanwhile, we do not know if there is a relationship between and Clean accuracy or not. Also, we cannot see the relation between , and the adversarial accuracy.
Figure 7: The landscape of the loss functional is shown along the dominant eigenvector of the Hessian on MNIST for M1. Note that the is in logarithm scale. Here is a scalar that perturbs the model parameters along the dominant eigenvector denoted by . The green line is the loss for a randomly batch with batch-size 320 on MNIST. The blue and red line are the training and test loss, respectively. From the figure we could see that the curvature of test loss is much larger than training.
Figure 8: We show the landscape of the test and training objective functional along the first eigenvector of the sub-sampled Hessian with , i.e. 320 samples from training dataset, on MNIST for M1. We plot both the batch loss as well as the total training and test loss. One can see that visually the results show that the robust models converge to a region with smaller curvature.
Figure 9: We show the landscape of the test and training objective functional along the first eigenvector of the sub-sampled Hessian with , i.e. 320 samples from training, on CIFAR-10 for C3. We plot both the batch loss as well as the total training and test loss. One can see that visually the results show that the curvature of robust models is smaller.
Figure 10: 1-D Parametric Plot on MNIST for M1 of and adversarial models. Here we are showing how the landscape of the total loss functional changes when we interpolate from the original model () to the robust model (). For all cases the robust model ends up at a point that has relatively smaller curvature compared to the original network.
Figure 11: 1-D Parametric Plot on CIFAR-10 of and adversarial models, i.e. total loss functional changes interpolating from the original model () to the robust model (). For all cases the robust model ends up at a point that has relatively smaller curvature compared to the original network.
Figure 12: Spectrum of the sub-sampled Hessian of the loss functional w.r.t. the model parameters computed by power iteration on MNIST of M1. The results are computed for different batch sizes of , , and . We report two cases for the single batch experiment, which is drawn randomly from the clean training data. The results show that the sub-sampled Hessian spectrum decreases for robust models. An interesting observation is that for the MNIST dataset, the original model has actually converged to a saddle point, even though it has a good generalization error. Also notice that the results for and are relatively close, which hints that the curvature for the full Hessian should also be smaller for the robust methods. This is demonstrated in Fig. 8.
Figure 13: The landscape of the loss functional is shown when the C2 model parameters are changed along the first two dominant eigenvectors of the Hessian. Here are scalars that perturbs the model parameters along the first and second dominant eigenvectors.
Figure 14: The landscape of the loss functional is shown when the M1 model parameters are changed along the first two dominant eigenvectors of the Hessian. Here are scalars that perturbs the model parameters along the first and second dominant eigenvectors.
Figure 15: The landscape of the loss functional is shown along the dominant eigenvector of the Hessian for C2 architecture on CIFAR-10 dataset. Here is a scalar that perturbs the model parameters along the dominant eigenvector denoted by .
Figure 16: Changes in the dominant eigenvalue of the Hessian w.r.t weights and the total gradient is shown for different epochs during training. Note the increase in (blue curve) for large batch vs small batch. In particular, note that the values for total gradient along with the Hessian spectrum show that large batch does not get “stuck” in saddle points, but areas in the optimization landscape that have high curvature. The dotted points show the corresponding results when we use robust optimization. We can see that this pushes the training to flatter areas. This clearly demonstrates the potential to use robust optimization as a means to avoid sharp minimas.
Batch Baseline Acc FB Acc Robust Acc
8000 0.7559 0.752 0.7612
10000 0.7561 0.1 0.7597
25000 0.7023 0.1 0.7409
50000 0.5523 0.1 0.7116
Table 8: Baseline accuracy is shown for large batch size for C1 model along with results aciheved with scaling learning rate method proposed by [17] (denoted by "FB Acc"). The last column shows results when training is performed with robust optimization. As we can see, the performance of the latter is actually better for large batch size. We emphasize that the goal is to perform analysis to better understand the problems with large batch size training. More extensive tests are needed before one could claim that robust optimization performs better than other methods.