In machine learning it is important that models learned are robust to out-of-distribution data and to the data distribution the model is trained with. One way to achieve this robustness is data augmentation, and adversarial training is a special case. While there are studies relating the various hyperparameters e.g., for good accuracy, there is less attention towards tuning hyperparameters for simultaneous natural accuracy and adversarial robustness. In this work we initiate such a study by training models with a fixed learning rate to batch size ratio, as proposed in . We evaluate the natural accuracy and adversarial robustness of models trained in this manner using only unperturbed inputs during the training phase. We study the Hessian spectrum of models trained in this fashion as done in  and report our observations.
Stochastic Gradient Descent (SGD) and its variants are the current workhorse for training neural network models. Hyperparameters like learning rate, batch size and momentum play an important role in SGD converging to models which generalizes well.  and  study some relations between these hyperparameters and suggest rules for better training accuracy. In 
the authors give a rule that adjusts learning rate as a function of minibatch size. This results in a significant speedup in the time it takes to train networks with ImageNet in a distributed setting.
It has been observed that there is a trade-off between generalization of a model and the batch size used in the training when the batch size is larger than 512.  show that large batch training results in neural networks settling down at sharp minima and not being able to generalize well. They also propose a solution to handle the sharp minima issue along with better generalization. On the other hand  report that sharp minima on deeper networks can also generalize well.  show that maintaining a constant ratio between learning rate and batch size during training results in networks converging to a flatter minima leading to better generalization.
has exposed a serious vulnerability in neural network-based models that achieve state-of-the-art results on tasks such as object recognition. These models are known to be vulnerable to small, pixel-wise perturbations to the inputs. While these changes are almost imperceptible to the human eye neural networks grossly misclassify such perturbed data, even when they classify the unperturbed data correctly. obtain these small perturbations using box-constrained L-BFGS, by maximizing the prediction error of the given model. In  the authors propose a quicker method based on gradients, the Fast Gradient Sign Method (FGSM). An FGSM adversarial perturbation of an input is given by . Here is the target, are the model parameters, and is the loss function used to train the network. Subsequent work has introduced multi-step variants of FGSM, notably, an iterative method (BIM or ) by  and Projected Gradient Descent (PGD) by . On visual tasks, the adversarial perturbation must come from a set of images that are perceptually similar to a given image. In  and  the authors study adversarial perturbations from an -ball around the input , namely, each pixel value is perturbed by a quantity within .
A simple method to obtain a adversarially robust network would be to include the perturbed samples into the training process. This is referred to as adversarial training. It is a special type of data augmentation technique where the data is augmented with adversarial perturbationi e.g. PGD while training. This method is expensive, but currently provides one of the most adversarially robust models. Some have tried a mixed approach. For example in 
, they initially train the network for 100 epochs to achieve good accuracy with unperturbed samples and then for 5-10 epochs do adversarial training with FGSM perturbated samples. While restrict their study to FGSM based training and FGSM based test accuracy, we have also included PGD based test results in Appendix for all the plots in our study, as PGD based attack is currently believed to be the strongest -based adversarial attack.
A given network can be made robust either by explicit regularization like adversarial training, weight conditioning or by implicit regularization through hyperparameter tuning. Our work could be viewed as understanding the influence of hyperparameters like learning rate, batch size, momentum on the adversarial robustness of networks trained with natural samples. Obtaining a naturally robust system without adversarial training is a desirable property.  address this issue at the architecture level, without adversarial training. In  the authors study the sample complexity of adversarial generalization and claim that a much larger sample size would be needed to achieve adversarial robustness.  observe that SGD with weight decay yields a robust network with better generalization than what can be achieved by adversarial training. Our focus is on the effect of hyperparameters of SGD on adversarial robustness, without weight decay.
 compute the Hessian spectrum of the parameter space of networks trained with large batch size and consistently observe that the top eigen value of the Hessian (the principle curvature) has a large magnitude. They empirically observe a direct correlation between the adversarial robustness of networks and principle curvature. They conclude that since large batch training leads to convergence to points with high principle curvature, the resulting models are not adversarially robust.
Our paper tries to understand the natural robustness of networks obtained by SGD hyperparameter tuning alone. Motivated by the works of  and  we seek to understand the relation between the network weights obtained under various hyperparameter settings of SGD and the resulting networks FGSM/PGD adversarial robustness. For this study we use MNIST, Fashion MNIST and CIFAR-10 datasets. To compare with existing literature we use the M1 and C1 models from . The StdCNN model we use is described in Table 1 and the ResNet18 model we use is from .
We make the following empirical observations.
Training models with a constant learning rate to batch size ratio not only leads to convergence to a flatter minima but also ensures that adversarial robustness does not degrade with increasing batch size during training.
We show that the Hessian based analysis of  does not always explain adversarial robustness in small vs large batch training.
We show that there are models which have higher Hessian spectrum when trained with large batch size yet have better adversarial accuracy when compared to models trained with small batch size.
Adding momentum does help in converging to a flatter minima - this is empirically substantiated by a lowering of the Hessian spectrum. Training with a larger momentum value leads to (in most cases) a more robust model than training with a smaller momentum value.
1.1 SGD, its variants and hyperparameters
The simplest gradient based training algorithm is batch gradient. Since all data does not fit into memory, running the vanilla batch gradient for each step becomes expensive because data needs to be brought into memory for gradient calculations. So, in practice, the algorithm used to train a neural network model to learn its free parameters by optimizing the loss function is a version of gradient descent known as Stochastic Gradient Descent (SGD). SGD and its variants have been shown to converge to good local minima of (a non-convex function) that generalize well. Another version of SGD used in practice is Mini-Batch Gradient Descent with Momentum (Algorithm 1). In this formulation we can see an interplay between three hyperparameters of SGD, the learning rate(), batch size(b) and momentum(). When the formulation reduces to Mini-Batch Gradient Descent. When and it becomes Stochastic Gradient Descent. When and b=N we get the vanilla Batch Gradient Descent. To maintain a balance, mini-batch gradient method is used in practice.
From the variants of SGD, we see that learning rate() is an important hyperparameter. It has been observed that tuning the learning rate itself can aid in better convergence of the SGD algorithm to a minima that generalizes well. The common rule of thumb in tuning learning rate is to decay/decrease it as we train the network. In  the authors note that instead of the usual practice of decreasing the learning rate over the epochs, even increasing the batch size leads to network with similar test accuracy. This has the advantage of faster training time with fewer parameter updates. They even run experiments with an inverse relation between the batch size and momentum and show that this too leads to better training with a small drop in test accuracy.
So batch size is an important hyperparameter in the training of neural networks. Larger the batch size, faster is the training of the network. But available computation resources restrict the maximum batch size for training a given network architecture.  and  describe some common rules and tricks used in practice, for optimizing the parameters of neural network models with SGD. They show that small batch sizes and below are generally preferred.  note that training with larger batch size is good for speeding up the training time but they have a severe problem with generalization. They suggest training for a longer time to improve the generalization of the network.
More recently,  show that maintaining a constant learning rate to batch size ratio aids the SGD algorithm to converge to a flat minima, which generalizes well to the test points.  characterise the adversarial (FGSM) vulnerability of the network obtained with large batch training with the top eigen value of the Hessian. They notice a correlation between the adversarial robustness (FGSM) of the network and the top eigen value of the Hessian. But the caveat is that they do the study under a certain hyperparameter setting. They train the network using a varying learning rate to batch size ratio (refer Section 2.1 for details of their hyperparameter setting). Therefore, we believe that a study of the role of hyperparameters like learning rate, batch size and the corresponding adversarial robustness is important. We empirically analyse the applicability of the theory of  to the understanding of the adversarial robustness of the network. We do this by systematically training the network with SGD using various hyperparameter settings, and checking the adversarial robustness of the resultant network. For each of these hyperparameter settings, we also study the role that momentum plays.
An important point to be noted in this paper is that at no point do we use adversarially perturbed samples in the training process. All the networks are trained with samples without adversarial perturbation.
2 Hyperparameter settings, model architectures and data sets used.
To get an understanding of the natural robustness of models, we trained models under various hyperparameter settings. For each of these settings we plotted the accuracy of the trained model and the adversarial accuracy of the trained model on test inputs which are adversarially perturbed. We emphasize again that we do not augment the training data with adversarially perturbed train inputs. We performed our experiments on the models used in  and also on StdCNN (refer to Table 1) and ResNet18.  only use FGSM based adversarial perturbations in their experiments for both training and testing. We provide results for both FGSM and PGD based test perturbations. In the next section we describe the hyperparameter settings used in our experiments and the models and data sets used.
2.1 Details of Datasets and Model Parameters
The following hyperparameter settings were used during training. For each of these settings we performed our experiments with momentum set to 0.0, 0.2, 0.5 and 0.9. In all our plots we clearly mention which hyperparameter settings have been used to obtain that plot. In Section 3, we discuss our observations with momentum set to zero and in Section 4, we discuss our observations with non-zero momentum.
LR (see ). Here learning rate is fixed to 0.01 and batch size is varied and training is done with this setting for 100 epochs. Results for this hyperparameter setting are shown in light blue colour in all our plots.
LR/BS(see ). Here learning rate to batch size ratio is kept constant. We set the ratio to 0.00015625 and training is done with this fixed setting for 100 epochs and varying batch sizes. Results for this hyperparameter setting are shown in purple in all our plots.
For comparison with , we also train models using exactly the settings from their paper. Here the learning rate is set to 0.01 and momentum to 0.9, and learning rate is decayed by half after every 5 epochs, for a total of 100 epochs. Results for this hyperparameter setting are shown in red in all our plots, and we refer to this as Benchmark.
Note that in the settings (1) and (2) above we do not use weight decay nor decay of learning rate. We fix the learning rate, batch size and momentum at the beginning of training with SGD, and no adaptive tuning is done to these settings during the training.
For each of the hyperparameter settings and momenta values in the experiments above, we computed the largest eigen value of the Hessian with respect to model parameters, for varying batch size and plot those graphs.
Data sets MNIST dataset consists of images of size, divided into classes. used for training, for validation and for testing. Fashion MNIST dataset consists of images of size, divided into classes. used for training, for validation and for testing. CIFAR-10 dataset consists of images of size, divided into classes. used for training, for validation and for testing.
Model Architectures For the MNIST and Fashion MNIST based experiments we use the architectures M1 and StdCNN as given in the Table 1.
For the CIFAR-10 based experiments we use the models C1 as given in Table 1 and ResNet18 architecture as given in . Input training data was augmented with random cropping and random horizontal flips by default.
Architectures M1 used for MNIST and Fashion MNIST experiments and C1 for CIFAR-10 experiments are as given in , which form the benchmark for comparison.
All the PGD based attack results in the Appendix for the corresponding FGSM attack based plots in the paper were plotted with step size k = 40.
|StdCNN||Conv(3,3,10) - Conv(3,3,10) - MP(2,2) -|
|Conv(3,3,20) - Conv(3,3,20) - MP(2,2) -|
|FC(50) - Dropout(0.5) - FC(10) - SM(10)|
|M1||Conv(5,5,20) - Conv(5,5,20) - FC(500) - SM(10)|
|C1||Conv(5,5,64) - MP(3,3) - BN - Conv(5,5,64) -|
|MP(3,3) - BN - FC(384) - FC(192) - SM(10)|
3 Adversarial robustness using hyperparameter tuning in natural training
In this section we report the results of our experiments when momentum is set to zero. For the benchmark alone we continue to use momentum 0.9.
We first verify that we get the same values and trends reported in . Our observations confirm the findings of . From the benchmark plots in Figures 1(left) and 4(left) it is clear that as the batch size increases the test accuracy decreases, and similarly the associated FGSM test accuracy also drops with increase in batch size. We were able to replicate the benchmark experiments on test accuracy. For adversarial robustness using FGSM we observe the same trend reported in - that the accuracy drops with larger batch size. However, our accuracy values are different. It’s clear from the top eigen value plots that as batch size increases principle curvature increases. This can be seen in the benchmark plots in Figure 1(right), Figure 4(right).
In Figure 1(left) we plot the test accuracy as a function of batch size, and the adversarial accuracy as a function of batch size using FGSM attack with for model M1 on the MNIST dataset. We plot this LR set to a constant (light blue line) and LR/BS set to a constant (purple line). For each setting of the hyperparameters, we computed the topmost eigen value as a function of batch size. This is plotted in Figure 1(right).
We also consider PGD attack on these models trained to classify MNIST. Figure 15 (in Appendix 0.A) shows the plots of accuracy and adversarial accuracy versus batch size for PGD attack on models M1 and StdCNN respectively, using with under various settings of hyperparameters. Figure 17 (in Appendix 0.A) shows the PGD attack plots for models C1 and ResNet18 classifying CIFAR-10, using .
For models M1 and StdCNN we also performed the above experiments when they were trained to classify Fashion MNIST. Figures 3(left), 6(left) plot the generalization as a function of batch size, and FGSM accuracy as a function of batch size for Fashion MNIST using . In Figures 3(right),6(right) we plot the top eigen value of the Hessian as a function of batch size.
Our observations confirm what  observe - that as batch size increases natural accuracy drops. We also confirm their observation that as batch size increases principle curvature increases. Our experiments also confirm the observations of , that training with a constant LR/BS ratio leads to flatter minima. This can be observed from the purple lines in the top eigen value plots in Figure 1(right), Figure2(right), Figure 4(right) and Figure 5(right), which are almost flat.
One major point to be noted here is that the purple line whether its generalization, FGSM/PGD accuracy or principle curvature of parameter space (top Hessian eigenvalue) there is very little variation across all models! So with a constant LR/BS ratio training, neither the natural accuracy nor the adversarial accuracy suffer with an increase in batch-size during training. This is not true for FGSM/PGD accuracy of models trained with constant LR nor is it true for benchmark, where one sees a drop in accuracy with increase in the batch-size during training.
3.2 Counter Example to the Hessian based analysis of Yao et al.
We give an example of a model which when trained with large batch has higher Hessian spectrum and also higher adversarial robustness compared to small batch training. For this experiment we use the network given in Table1 trained on Fashion MNIST. We train the network with a fixed learning rate. We observe in Figure 6(left) that the FGSM accuracy increases with batch size (the light blue line in the plots). As expected the curvature also increases. This can be seen in Figure 6(right). On this model we see the same behaviour with respect to the PGD attack. This can be seen from Figure 18(right) (in Appendix 0.A.1) where the PGD accuracy also increases with increasing batch size. So this clearly indicates that an increase in Hessian spectrum alone cannot explain the change in adversarial robustness of neural networks.
4 The effect of momentum on adversarial robustness
We now analyse the role of momentum in the two hyperparameter settings, LR and LR/BS that we used in Section 3 to compare with Benchmark. For each of these settings, we train each model with momentum set to 0.0, 0.2, 0.5 and 0.9, and plot the natural accuracy on the test data as a function of batch size and the adversarial accuracy as a function of batch size when the test inputs are adversarially perturbed using FGSM and PGD (in Appendix 0.B). We also plot the top eigen values of the Hessian as a function of batch size in each of these settings.
Its clear from all the Hessian eigenvalue plots in Figures 7, 8, 11, 12 for MNIST and Figures 9, 10, 13, 14 for CIFAR-10 that in both the set ups LR and LR/BS, accuracy improves with increased momentum. Furthermore, increasing momentum leads to convergence to points with a lower Hessian spectrum.
4.1 Effect of Momentum with LR
For a finer analysis of the impact of momentum with fixed LR, we plot the natural accuracy and adversarial accuracy with momentum values set to 0.0, 0.2, 0.5 and 0.9 for the models M1 and StdCNN on MNIST and models C1 and ResNet18 on CIFAR-10. Figures 7(left), 8(left), 9(left) and 10(left) show the generalization trend and, as expected, with larger momentum there is better generalization. This in turn leads to better FGSM adversarial robustness as seen in Figures 7(left) and 8(left) for MNIST and Figures 9(left) and 10(left) for CIFAR-10. For PGD robustness plots refer to Figure 19 for MNIST and Figure 20 for CIFAR-10. Similarly in Figures 7(right), 8(right), 9(right) and 10(right) its seen that the curvature reduces with larger momentum. The PGD based plots are in Appendix 0.B.1.
4.2 Effect of Momentum with LR/BS
For a finer analysis of the impact of momentum with fixed LR/BS we plot the natural accuracy and adversarial accuracy with momentum values set to 0 and 0.5 or 0.9 for the models M1 and StdCNN on MNIST and models C1 and ResNet18 on CIFAR-10. In Figures 11, 12, 13 and 14 its seen that in most cases the curvature reduces with larger momentum, but as compared to fixed LR training there is no significant role of momentum with constant learning rate and batch size ratio. As the mild change in curvature or generalization does not always convert into better generalization or adversarial robustness as seen in Figures 11(left), 12(left), 13(left) and 14(left) for FGSM attack or Figures 21 and 22 for PGD attack. The PGD based plots are in Appendix 0.B.2. In this section we use fewer values of the momentum for the study as from the plots it is evident that momentum in the LR/BS setting doesnot play a significant role in the training and the resultant accuracy and adversarial robustness of the network.
We show how the modelling of SGD by  and the Hessian spectrum can help understand the weight space and its adversarial properties. We also see how momentum plays a role in reducing the spectrum of the parameters irrespective of the ratio maintained between learning rate and batch size. We believe the paper in its current form tries to understand the role of hyperparameters and the resultant networks robustness without any perturbed input. This would be necessary to gauge the impact of adversarial training on top of it and could aid in adapting the hyperparameters for adversarial training. It also opens the possibility of networks achieving adversarial robustness without adversarial training and only hyperparameter tuning.
-  (2012) Practical recommendations for gradient-based training of deep architectures. arXiv preprint arXiv:1206.5533. Cited by: §1.1.
-  (2017) Evasion attacks against machine learning at test time. arXiv preprint arXiv:1708.06131. Cited by: §1.
-  (2012) Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition, pp. 421–436. Cited by: §1.1.
-  (2017) Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933. Cited by: §1.
-  (2018) Adversarial training versus weight decay. arXiv preprint arXiv:1804.03308. Cited by: §1.
-  (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations , pp. . Cited by: §1.
-  (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1.
-  (2016) Deep residual learning for image recognition. , pp. 770–778. Cited by: §1, §2.1.
-  (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. CoRR, pp. 1731–1741. External Links: Cited by: §1.1, §1.
-  (2017) Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623. Cited by: How do SGD hyperparameters in natural training affect adversarial robustness?, §1.1, §1, §1, §1, item (2), item (i), §5.
On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §1.
-  (2017) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 , pp. . Cited by: §1.
-  (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations , pp. . Cited by: §1.
-  (2017) Dynamic routing between capsules. arXiv preprint arXiv:1710.09829. Cited by: §1.
-  (2018) Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285. Cited by: §1.
-  (2017) Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489. Cited by: §1.1, §1, §1, item (1).
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 , pp. . Cited by: §1.
-  (2018) Hessian-based analysis of large batch training and robustness to adversaries. arXiv preprint arXiv:1802.08241. Cited by: How do SGD hyperparameters in natural training affect adversarial robustness?, §0.A.1, item , §1.1, §1, §1, §1, §1, item (3), §2.1, §2, item (i), item (iii), §3.2, §3, §3.