Batch Normalization is a Cause of Adversarial Vulnerability

05/06/2019 ∙ by Angus Galloway, et al. ∙ 0

Batch normalization (batch norm) is often used in an attempt to stabilize and accelerate training in deep neural networks. In many cases it indeed decreases the number of parameter updates required to reduce the training error. However, it also reduces robustness to small input perturbations and noise by double-digit percentages, as we show on five standard datasets. Furthermore, substituting weight decay for batch norm is sufficient to nullify the relationship between adversarial vulnerability and the input dimension. Our work is consistent with a mean-field analysis that found that batch norm causes exploding gradients.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Batch norm is a standard component of modern deep neural networks, and tends to make the training process less sensitive to the choice of hyperparameters in many cases 

[13]. While ease of training is desirable for model developers, an important concern among stakeholders is that of model robustness to plausible, previously unseen inputs during deployment. The adversarial examples phenomenon has exposed unstable predictions across state-of-the-art models [27]. This has led to a variety of methods that aim to improve robustness, but doing so effectively remains a challenge [1, 20, 11, 14]. We believe that a prerequisite to developing methods that increase robustness is an understanding of factors that reduce it.

Figure 1: Two mini-batches from the “Adversarial Spheres” dataset (2D), and their representations in a deep linear network with batch norm at initialization. Mini-batch membership is indicated by marker fill and class membership by colour. Each layer is projected to its first two principal components. Classes are mixed by Layer 14.

Approaches for improving robustness often begin with existing neural network architectures—that use batch norm—and patching them against specific attacks, e.g., through inclusion of adversarial examples during training [27, 9, 15, 16]. An implicit assumption is that batch norm itself does not reduce robustness – an assumption that we tested empirically and found to be invalid. In the original work that introduced batch norm, it was suggested that other forms of regularization can be turned down or disabled when using it without decreasing standard test accuracy. Robustness, however, is less forgiving: it is strongly impacted by the disparate mechanisms of various regularizers. The frequently made observation that adversarial vulnerability can scale with the input dimension [9, 8, 24] highlights the importance of identifying regularizers as more than merely a way to improve test accuracy. In particular, batch norm was a confounding factor in [24], making the results of their initialization-time analysis hold after training. By adding regularization and removing batch norm, we show that there is no inherent relationship between adversarial vulnerability and the input dimension.

2 Batch Normalization

We briefly review how batch norm modifies the hidden layers’ pre-activations of a neural network. We use the notation of [32], where

is the index for a neuron,

for the layer, and for a mini-batch of samples from the dataset; denotes the number of neurons in layer , is the matrix of weights and

is the vector of biases that parametrize layer

. The batch mean is defined as

, and the variance is

. In the batch norm procedure, the mean is subtracted from the pre-activation of each unit (consistent with [13]

), the result is divided by the standard deviation

plus a small constant to prevent division by zero, then scaled and shifted by the learned parameters and , respectively. This is described in Eq. (1), where a per-unit nonlinearity

, e.g., ReLU, is applied after the normalization.

(1)

Note that this procedure fixes the first and second moments of all neurons

equally at initialization, independent of the width or depth of the network. This suppresses the information contained in these moments. Because batch norm induces a non-local batch-wise nonlinearity to each unit , this loss of information cannot be recovered by the parameters and . Furthermore, it has been widely observed empirically that these parameters do not influence the effect being studied [31, 33, 32]. Thus, and can be incorporated into the per-unit nonlinearity without loss of generality. To understand how batch normalization is harmful, consider two mini-batches that differ by only a single example: due to the induced batch-wise nonlinearity, they will have different representations for each example [32]

. This difference is further amplified by stacking batch norm layers. Conversely, normalization of intermediate representations for two different inputs impair the ability of batch-normalized networks to distinguish high-quality examples (as judged by an “oracle”) that ought to be classified with a large prediction margin, from low-quality, i.e., more ambiguous, instances. The last layer of a discriminative neural network, in particular, is typically a linear decoding of class label-homogeneous clusters, and thus makes extensive use of information represented via differences in mean and variance at this stage for the purpose of classification. We argue that this information loss and inability to maintain relative distances in the input space reduces adversarial as well as general robustness. Figure 

1 shows a degradation of class-relevant input distances in a batch-normalized linear network on a 2D variant of the “Adversarial Spheres” dataset [8].111We add a ReLU nonlinearity when attempting to learn the binary classification task posed by [8]. In Appendix C we show that batch norm increases sensitivity to the learning rate in this case. Conversely, class membership is preserved in arbitrarily deep unnormalized networks (See Figure 7 of Appendix C), but we require a scaling factor to increase the magnitude of the activations to see this visually.

3 Empirical Result

We first evaluate the robustness (quantified as the drop in test accuracy under input perturbations) of convolutional networks, with and without batch norm, that were trained using standard procedures. The datasets – MNIST, SVHN, CIFAR-10, and ImageNet – were normalized to zero mean and unit variance. As a white-box adversarial attack we use projected gradient descent (PGD),

- and -norm variants, for its simplicity and ability to degrade performance with little perceptible change to the input [16]. We run PGD for 20 iterations, with and a step size of for SVHN, CIFAR-10, and for ImageNet. For PGD- we set , where is the input dimension. We report the test accuracy for additive Gaussian noise of zero mean and variance , denoted as “Noise” [5], as well as the CIFAR-10-C common corruption benchmark [11]

. We found these methods were sufficient to demonstrate a considerable disparity in robustness due to batch norm, but this is not intended as a formal security evaluation. All uncertainties are the standard error of the mean.

222Each experiment has a unique uncertainty, hence the number of decimal places varies.

BN Clean Noise PGD- PGD-
Table 1: Test accuracies of VGG8 on SVHN.

For the SVHN dataset, models were trained by stochastic gradient descent (SGD) with momentum 0.9 for 50 epochs, with a batch size of 128 and initial learning rate of

, which was dropped by a factor of ten at epochs 25 and 40. Trials were repeated over five random seeds. We show the results of this experiment in Table 1, finding that despite batch norm increasing clean test accuracy by , it reduced test accuracy for additive noise by , for PGD- by , and for PGD- by .

CIFAR-10 CIFAR-10.1
Model BN Clean Noise PGD- PGD- Clean Noise
VGG
VGG
WRN F
WRN
Table 2: Test accuracies of VGG8 and WideResNet–28–10 on CIFAR-10 and CIFAR-10.1 (v6) in several variants: clean, noisy, and PGD perturbed.

For the CIFAR-10 experiments we trained models with a similar procedure as for SVHN, but with random

crops using four-pixel padding, and horizontal flips. We evaluate two families of contemporary models, one without skip connections (VGG), and WideResNets (WRN) using “Fixup” initialization 

[34] to reduce the use of batch norm. In the first experiment, a basic comparison with and without batch norm shown in Table 2, we evaluated the best model in terms of test accuracy after training for 150 epochs with a fixed learning rate of . In this case, inclusion of batch norm for VGG reduces the clean generalization gap (difference between training and test accuracy) by . For additive noise, test accuracy drops by , and for PGD perturbations by and for and variants, respectively.

Model Test Accuracy ()
L BN Clean Noise PGD-
8
8
11
11
13
13
16
19
Table 3: VGG models of increasing depth on CIFAR-10, with and without batch norm (BN). See text for differences in hyperparameters compared to Table 2.

Very similar results are obtained on a new test set, CIFAR-10.1 v6 [18]: batch norm slightly improves the clean test accuracy (by ), but leads to a considerable drop in test accuracy of for the case with additive noise, and and respectively for and PGD variants (PGD absolute values omitted for CIFAR-10.1 in Table 2 for brevity). It has been suggested that one of the benefits of batch norm is that it facilitates training with a larger learning rate [13, 2]. We test this from a robustness perspective in an experiment summarized in Table 3, where the initial learning rate was increased to when batch norm was used. We prolonged training for up to 350 epochs, and dropped the learning rate by a factor of ten at epoch 150 and 250 in both cases, which increases clean test accuracy relative to Table 2. The deepest model that was trainable using standard “He” initialization [10] without batch norm was VGG13. 333For which one of ten random seeds failed to achieve better than chance accuracy on the training set, while others performed as expected. We report the first three successful runs for consistency with the other experiments. None of the deeper batch-normalized models recovered the robustness of the most shallow, or same-depth unnormalized equivalents, nor does the higher learning rate with batch norm improve robustness compared to baselines trained for the same number of epochs. Additional results for deeper models on SVHN and CIFAR-10 can be found in Appendix A.3.We also evaluated robustness on the common corruption benchmark comprising 19 types of real-world effects that can be grouped into four categories: “noise”, “blur”, “weather”, and “digital” corruptions [11]. Each corruption has five “severity” or intensity levels. We report the mean error on the corrupted test set (mCE) by averaging over all intensity levels and corruptions [11]. We summarize the results for two VGG variants and a WideResNet on CIFAR-10-C, trained from scratch on the default training set for three and five random seeds respectively. Accuracy for the noise corruptions, which caused the largest difference in accuracy with batch norm, are outlined in Table 4. The key takeaway is: For all models tested, the batch-normalized variant had a higher error rate for all corruptions of the “noise” category, at every intensity level.

Model Test Accuracy ()
Variant BN Clean Gaussian Impulse Shot Speckle
VGG8
VGG13
WRN28 F
Table 4:

Robustness of three modern convolutional neural network architectures with and without batch norm on the

CIFAR-10-C common “noise” corruptions [11]. We use “F” to denote the Fixup variant of WRN. Values were averaged over five intensity levels for each corruption.

Averaging over all 19 corruptions we find that batch norm increased mCE by for VGG8, for VGG13, and for WRN. There was a large disparity in accuracy when modulating batch norm for different corruption categories, therefore we examine these in more detail.

Model Top 5 Test Accuracy ()
Model BN Clean Noise PGD-
VGG-11
VGG-11
VGG-13
VGG-13
VGG-16
VGG-16
VGG-19
VGG-19
AlexNet
DenseNet121
ResNet18
Table 5: Models from torchvision.models pre-trained on ImageNet, some with and some without batch norm (BN).

For VGG8, the mean generalization gaps for noise were: Gaussian—, Impulse—, Shot—, and Speckle—. After the “noise” category the next most damaging corruptions (by difference in accuracy due to batch norm) were: Contrast—, Spatter—, JPEG—, and Pixelate—. Results for the remaining corruptions were a coin toss as to whether batch norm improved or degraded robustness, as the random error was in the same ballpark as the difference being measured. For VGG13, the batch norm accuracy gap enlarged to for Gaussian noise at severity levels 3, 4, and 5; and over for Impulse noise at levels 4 and 5. Averaging over all levels, we have gaps for noise variants of: Gaussian—, Impulse—, Shot—, and Speckle—. Robustness to the other corruptions seemed to benefit from the slightly higher clean test accuracy of for the batch-normalized VGG13. The remaining generalization gaps varied from (negative) for Zoom blur, to for Pixelate. For the WRN, the mean generalization gaps for noise were: Gaussian—, Impulse—, Shot—, and Speckle—. Note that the large uncertainty for these measurements is due to high variance for the model with batch norm, on average versus for Fixup. JPEG compression was next at . Interestingly, some corruptions that led to a positive gap for VGG8 showed a negative gap for the WRN, i.e., batch norm improved accuracy to: Contrast—, Snow—, Spatter—. These were the same corruptions for which VGG13 lost, or did not improve its robustness when batch norm was removed, hence why we believe these correlate with standard test accuracy (highest for WRN). Visually, these corruptions appear to preserve texture information. Conversely, noise is applied in a spatially global way that disproportionately degrades these textures, emphasizing shapes and edges. It is now known that modern CNNs trained on standard image datasets have a propensity to rely on texture, but we would rather they use shape and edge cues [7, 3]. Our results support the idea that batch norm may be exacerbating this tendency to leverage superficial textures for classification of image data. Next, we evaluated the robustness of pre-trained ImageNet models from the torchvision.models repository, which conveniently provides models with and without batch norm.444https://pytorch.org/docs/stable/torchvision/models.html, v1.1.0. Results are shown in Table 5, where batch norm improves top-5 accuracy on noise in some cases, but consistently reduces it by to (absolute) for PGD. The trends are the same for top-1 accuracy, only the absolute values were smaller; the degradation varies from to . Given the discrepancy between noise and PGD for ImageNet, we include a black-box transfer analysis in Appendix A.4 that is consistent with the white-box analysis.

Figure 2: We extend the experiment of [32] by training fully-connected nets of depth and constant-width () ReLU layers by SGD, batch size , and learning rate on MNIST. The batch norm parameters and were left as default, momentum disabled, and . The dashed line is the theoretical maximum trainable depth of batch-normalized networks as a function of the batch size. We report the clean test accuracy, and that for additive Gaussian noise and BIM perturbations. The batch-normalized models were trained for 10 epochs, while the unnormalized were trained for 40 epochs as they took longer to converge. The 40 epoch batch-normalized plot was qualitatively similar with dark blue bands for BIM for shallow and deep variants. The dark blue patch for 55 and 60 layer unnormalized models at large batch sizes depicts a total failure to train. These networks were trainable by reducing , but for consistency we keep the same in both cases.

Finally, we explore the role of batch size and depth in Figure 2. Batch norm limits the maximum trainable depth, which increases with the batch size, but quickly plateaus as predicted by Theorem 3.10 of [32]. Robustness decreases with the batch size for depths that maintain a reasonable test accuracy, at around 25 or fewer layers. This tension between clean accuracy and robustness as a function of the batch size is not observed in unnormalized networks.

Figure 3: Estimated mutual information between quantized representations and input for batch-normalized models from Figure 2. Values are lower-bounded by by the number of classes and upper-bounded by by the number of samples in the training set. Estimates accurate to within bits of  [17].

In unnormalized networks, we observe that perturbation robustness increases with the depth of the network. This is consistent with the computational benefit of the hidden layers proposed by [23], who take an information-theoretic approach. This analysis uses two mutual information terms: – the information in the layer activations about the input , which is a measure of representational complexity, and – the information in the activations about the label , which is understood as the predictive content of the learned input representations . It is shown that under SGD training, generally increases with the number of epochs, while increases initially, but reduces throughout the later stage of the training procedure. An information-theoretic proof as to why reducing , while ensuring a sufficiently high value of , should promote good generalization from finite samples is given in [29, 22]. We estimate of the batch-normalized networks from the experiment in Figure 2 for sub-sampled batch sizes and plot it in Figure 3. We assume , since the networks are noiseless and thus is deterministic given . We use the “plug-in” maximum-likelihood estimate of the entropy using the full MNIST training set [17]. Activations

are taken as the softmax output, which was quantized to 7-bit accuracy. The number of bits was determined by reducing the precision as low as possible without inducing classification errors. This provides a notion of the model’s “intrinsic” precision. We use the confidence interval:

recommended by [17], which contains both bias and variance terms for for the regime where and . This is multiplied by ten for each dimension. Our first observation is that the configurations where is low—indicating a more compressed representation—are the same settings where the model obtains high clean test accuracy. The transition of at around 10 bits occurs remarkably close to the theoretical maximum trainable depth of layers. For the unnormalized network, the absolute values of were almost always in the same ballpark or less than the lowest value obtained by any batch-normalized network, which was bits. We therefore omit the comparable figure for brevity, but note that did continue to decrease with depth in many cases, e.g., from to for a mini-batch size of 20, but unfortunately these differences were small compared to the worst-case error. The fact that is small where BIM robustness is poor for batch-normalized networks disagrees with our initial hypothesis that more layers were needed to decrease . However, this result is consistent with the observation that it is possible for networks to overfit via too much compression [23]. In particular, [32] prove that batch norm loses the information between mini-batches exponentially quickly in the depth of the network, so over-fitting via “too much” compression is consistent with our results. This intuition requires further analysis, which is left for future work.

4 Vulnerability and Input Dimension

A recent work [24] analyzes adversarial vulnerability of batch-normalized networks at initialization time and conjectures based on a scaling analysis that, under the commonly used [10] initialization scheme, adversarial vulnerability scales as .

Model Test Accuracy ()
BN Clean Noise
28
56
84
Table 6: Evaluating the robustness of a MLP with and without batch norm. See text for architecture. We observe a reduction in test accuracy due to batch norm for compared to .

They also show in experiments that independence between vulnerability and the input dimension can be approximately recovered through adversarial training by projected gradient descent (PGD) [16], with a modest trade-off of clean accuracy. We show that this can be achieved by simpler means and with little to no trade-off through weight decay, where the regularization constant corrects the loss scaling as the norm of the input increases with .

Model Test Accuracy ()
BN Clean Noise
56
84
Table 7: Evaluating the robustness of a MLP with weight decay (same as for linear model, see Table 5 of Appendix B). See text for architecture. Adding batch norm degrades all accuracies.

We increase the MNIST image width from 28 to 56, 84, and 112 pixels. The loss is predicted to grow like for -sized attacks by Thm. 4 of [24]. We confirm that without regularization the loss does scale roughly as predicted: the predicted values lie between loss ratios obtained for and attacks for most image widths (see Table 4 of Appendix B). Training with weight decay, however, we obtain adversarial test accuracy ratios of , , and and clean accuracy ratios of , , and for of 56, 84, and 112 respectively, relative to the original dataset. A more detailed explanation and results are provided in Appendix B. Next, we repeat this experiment with a two-hidden-layer ReLU MLP, with the number of hidden units equal to the half the input dimension, and optionally use one hidden layer with batch norm.555This choice of architecture is mostly arbitrary, the trends were the same for constant width layers. To evaluate robustness, 100 iterations of BIM- were used with a step size of 1e-3, and . We also report test accuracy with additive Gaussian noise of zero mean and unit variance, the same first two moments as the clean images.666We first apply the noise to the original 2828 pixel images, then resize them to preserve the appearance of the noise. Despite a difference in clean accuracy of only , Table 6 shows that for the original image resolution, batch norm reduced accuracy for noise by , and for BIM- by . Robustness keeps decreasing as the image size increases, with the batch-normalized network having less robustness to BIM and less to noise at all sizes. We then apply the regularization constants tuned for the respective input dimensions on the linear model to the ReLU MLP with no further adjustments. Table 7 shows that by adding sufficient regularization () to recover the original (, no BN) accuracy for BIM of when using batch norm, we induce a test error increase of , which is substantial on MNIST. Furthermore, using the same regularization constant without batch norm increases clean test accuracy by , and for the BIM- perturbation by . Following the guidance in the original work on batch norm [13] to the extreme (): to reduce weight decay when using batch norm, accuracy for the perturbation is degraded by for , and for . In all cases, using batch norm greatly reduced test accuracy for noisy and adversarially perturbed inputs, while weight decay increased accuracy for such inputs.

5 Related Work

Our work examines the effect of batch norm on model robustness at test time. Many references which have an immediate connection to our work were discussed in the previous sections; here we briefly mention other works that do not have a direct relationship to our experiments, but are relevant to the topic of batch norm in general. The original work [13] that introduced batch norm as a technique for improving neural network training and test performance motivated it by the “internal covariate shift” – a term refering to the changing distribution of layer outputs, an effect that requires subsequent layers to steadily adapt to the new distribution and thus slows down the training process. Several follow-up works started from the empirical observation that batch norm usually accelerates and stabilizes training, and attempted to clarify the mechanism behind this effect. One argument is that batch-normalized networks have a smoother optimization landscape due to smaller gradients immediately before the batch-normalized layer [19]. However, [32] study the effect of stacking many batch-normalized layers and prove that this causes gradient explosion that is exponential in the depth of the network for any non-linearity. In practice, relatively shallow batch-normalized networks yield the expected “helpful smoothing” of the loss surface property [19], while very deep networks are not trainable [32]. In our work, we find that a single batch-normalized layer suffices to induce severe adversarial vulnerability.

[3pt] [3pt] [3pt] [3pt] [3pt] [3pt] [3pt]

[3pt] [3pt] [3pt] [3pt] [3pt] [3pt] [3pt]

Figure 4: Visualization of activations in a two-unit layer over 500 epochs. Model is a fully-connected MLP [LABEL: (784–392–196–2–49–10) and LABEL: (784–392–BN–196–BN–2–49–10)] with ReLU units, mini-batch size 128, learning rate 1e-2, and weight decay =1e-3. The plots have a fixed x- and y-axis range of . Samples from the MNIST training set are plotted and colored by label.

In Figure 4 we visualize the activations of the penultimate hidden layer in a fully-connected network LABEL: without and LABEL: with batch norm over the course of 500 epochs. In the unnormalized network 4, all data points are overlapping at initialization. Over the first epochs, the points spread further apart (middle plot) and begin to form clusters. In the final stage, the clusters become tighter. When we introduce two batch-norm layers in the network, placing them before the visualized layer, the activation patterns display notable differences, as shown in Figure 4: i) at initialization, all data points are spread out, allowing easier partitioning into clusters and thus facilitating faster training. We believe this is associated with the “helpful smoothing” property identified by [19] for shallow networks; ii) the clusters are more stationary, and the stages of cluster formation and tightening are not as distinct; iii) the inter-cluster distance and the clusters themselves are larger. Weight decay’s loss scaling mechanism is complementary to other mechanisms identified in the literature, for instance that it increases the effective learning rate [31, 33]. Our results are consistent with these works in that weight decay reduces the generalization gap (between training and test error), even in batch-normalized networks where it is presumed to have no effect. Given that batch norm is not typically used on all layers, the loss scaling mechanism persists although to a lesser degree in this case.

6 Conclusion

We found that there is no free lunch with batch norm: the accelerated training properties and occasionally higher clean test accuracy come at the cost of robustness, both to additive noise and for adversarial perturbations. We have shown that there is no inherent relationship between the input dimension and vulnerability. Our results highlight the importance of identifying the disparate mechanisms of regularization techniques, especially when concerned about robustness.

Acknowledgements

The authors wish to acknowledge the financial support of NSERC, CFI, CIFAR and EPSRC. We also acknowledge hardware support from NVIDIA and Compute Canada. Research at the Perimeter Institute is supported by Industry Canada and the province of Ontario through the Ministry of Research & Innovation. We thank Thorsteinn Jonsson for helpful discussions; Colin Brennan, Terrance DeVries and Jörn-Henrik Jacobsen for technical suggestions; Justin Gilmer for suggesting the common corruption benchmark; Maeve Kennedy, Vithursan Thangarasa, Katya Kudashkina, and Boris Knyazev for comments and proofreading.

References

  • [1] A. Athalye, N. Carlini, and D. Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In

    International Conference on Machine Learning

    , pages 274–283, 2018.
  • [2] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger. Understanding Batch Normalization. In Advances in Neural Information Processing Systems 31, pages 7705–7716. Curran Associates, Inc., 2018.
  • [3] W. Brendel and M. Bethge. Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet. In International Conference on Learning Representations, 2019.
  • [4] G. W. Ding, L. Wang, and X. Jin. AdverTorch v0.1: An Adversarial Robustness Toolbox based on PyTorch. arXiv preprint arXiv:1902.07623, 2019.
  • [5] N. Ford, J. Gilmer, and E. D. Cubuk. Adversarial Examples Are a Natural Consequence of Test Error in Noise. 2019.
  • [6] A. Galloway, T. Tanay, and G. W. Taylor. Adversarial Training Versus Weight Decay. arXiv preprint arXiv:1804.03308, 2018.
  • [7] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
  • [8] J. Gilmer, L. Metz, F. Faghri, S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow. Adversarial Spheres. In International Conference on Learning Representations Workshop Track, 2018.
  • [9] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations, 2015.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In

    International Conference on Computer Vision

    , pages 1026–1034. IEEE Computer Society, 2015.
  • [11] D. Hendrycks and T. Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In International Conference on Learning Representations, 2019.
  • [12] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30, pages 1731–1741. Curran Associates, Inc., 2017.
  • [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
  • [14] J.-H. Jacobsen, J. Behrmann, N. Carlini, F. Tramèr, and N. Papernot. Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness. Safe Machine Learning workshop at ICLR, 2019.
  • [15] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial Machine Learning at Scale. International Conference on Learning Representations, 2017.
  • [16] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu.

    Towards Deep Learning Models Resistant to Adversarial Attacks.

    In International Conference on Learning Representations, 2018.
  • [17] L. Paninski. Estimation of Entropy and Mutual Information. In Neural Computation, volume 15, pages 1191–1253. 2003.
  • [18] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do CIFAR-10 Classifiers Generalize to CIFAR-10? arXiv:1806.00451, 2018.
  • [19] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How Does Batch Normalization Help Optimization? In Advances in Neural Information Processing Systems 31, pages 2488–2498. 2018.
  • [20] L. Schott, J. Rauber, M. Bethge, and W. Brendel. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2019.
  • [21] D. Sculley, J. Snoek, A. Wiltschko, and A. Rahimi. Winner’s Curse? On Pace, Progress, and Empirical Rigor. In International Conference on Learning Representations, Workshop, 2018.
  • [22] R. Shwartz-Ziv, A. Painsky, and N. Tishby. REPRESENTATION COMPRESSION AND GENERALIZATION IN DEEP NEURAL NETWORKS. 2019.
  • [23] R. Shwartz-Ziv and N. Tishby. Opening the Black Box of Deep Neural Networks via Information. arXiv:1703.00810 [cs], 2017.
  • [24] C.-J. Simon-Gabriel, Y. Ollivier, L. Bottou, B. Schölkopf, and D. Lopez-Paz. Adversarial Vulnerability of Neural Networks Increases With Input Dimension. arXiv:1802.01421 [cs, stat], 2018.
  • [25] D. Soudry, E. Hoffer, M. S. Nacson, and N. Srebro. The Implicit Bias of Gradient Descent on Separable Data. In International Conference on Learning Representations, 2018.
  • [26] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao. Is Robustness the Cost of Accuracy? – A Comprehensive Study on the Robustness of 18 Deep Image Classification Models. In Computer Vision – ECCV 2018, pages 644–661. Springer International Publishing, 2018.
  • [27] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
  • [28] T. Tanay and L. D. Griffin. A Boundary Tilting Persepective on the Phenomenon of Adversarial Examples. arXiv:1608.07690, 2016.
  • [29] N. Tishby and N. Zaslavsky. Deep Learning and the Information Bottleneck Principle. In Information Theory Workshop, pages 1–5. IEEE, 2015.
  • [30] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry.

    Robustness May Be at Odds with Accuracy.

    In International Conference on Learning Representations, 2019.
  • [31] T. van Laarhoven. L2 Regularization versus Batch and Weight Normalization. arXiv:1706.05350, 2017.
  • [32] G. Yang, J. Pennington, V. Rao, J. Sohl-Dickstein, and S. S. Schoenholz. A Mean Field Theory of Batch Normalization. In International Conference on Learning Representations, 2019.
  • [33] G. Zhang, C. Wang, B. Xu, and R. Grosse. Three Mechanisms of Weight Decay Regularization. In International Conference on Learning Representations, 2019.
  • [34] H. Zhang, Y. N. Dauphin, and T. Ma. Residual Learning Without Normalization via Better Initialization. In International Conference on Learning Representations, 2019.

Appendix A Supplement to Empirical Results

This section contains supplementary explanations and results to those of Section 3.

a.1 Why the VGG Architecture?

For SVHN and CIFAR-10 experiments, we selected the VGG family of models as a simple yet contemporary convolutional architecture whose development occurred independent of batch norm. This makes it suitable for a causal intervention, given that we want to study the effect of batch norm itself, and not batch norm + other architectural innovations + hyperparameter tuning. State-of-the-art architectures, such as Inception and ResNet, whose development is more intimately linked with batch norm may be less suitable for this kind of analysis. The superior standard test accuracy of these models is somewhat moot given a trade-off between standard test accuracy and robustness, demonstrated in this work and elsewhere [28, 6, 26, 30]. Aside from these reasons, and provision of pre-trained variants on ImageNet with and without batch norm in torchvision.models for ease of reproducibility, this choice of architecture is arbitrary.

a.2 Comparison of PGD to BIM

We used the PGD implementation from [4] with settings as below. The pixel range was set to for SVHN, and for CIFAR-10 and ImageNet:

from advertorch.attacks import LinfPGDAttack
adversary = LinfPGDAttack(net, loss_fn=nn.CrossEntropyLoss(reduction="sum"),
    eps=0.03, nb_iter=20, eps_iter=0.003,
    rand_init=False, clip_min=-1.0, clip_max=1.0, targeted=False)

We compared PGD using a step size of to our own BIM implemenation with a step size of , for the same number (20) of iterations. This reduces test accuracy for perturbations from for BIM to for PGD for the unnormalized VGG8 network, and from to for the batch-normalized network. The difference due to batch norm is identical in both cases: . Results were also consistent between PGD and BIM for ImageNet. We also tried increasing the number of PGD iterations for deeper networks. For VGG16 on CIFAR-10, using 40 iterations of PGD with a step size of , instead of 20 iterations with , reduced accuracy from to , a difference of only .

a.3 Additional SVHN and CIFAR-10 Results for Deeper Models

Our first attempt to train VGG models on SVHN with more than 8 layers failed, therefore for a fair comparison we report the robustness of the deeper models that were only trainable by using batch norm in Table 8. None of these models obtained much better robustness in terms of PGD-, although they did better for PGD-.

Test Accuracy ()
L Clean Noise PGD- PGD-
11
13
16
19
Table 8: VGG variants on SVHN with batch norm.

Fixup initialization was recently proposed to reduce the use of normalization layers in deep residual networks [34]. As a natural test we compare a WideResNet (28 layers, width factor 10) with Fixup versus the default architecture with batch norm. Note that the Fixup variant still contains one batch norm layer before the classification layer, but the number of batch norm layers is still greatly reduced.777We used the implementation from https://github.com/valilenk/fixup, but stopped training at 150 epochs for consistency with the VGG8 experiment. Both models had already fit the training set by this point.

CIFAR-10 CIFAR-10.1
Model Clean Noise PGD- PGD- Clean Noise
Fixup
BN
Table 9: Accuracies of WideResNet–28–10 on CIFAR-10 and CIFAR-10.1 (v6).

We train WideResNets (WRN) with five unique seeds and show their test accuracies in Table 9. Consistent with [18], higher clean test accuracy on CIFAR-10, i.e. obtained by the WRN compared to VGG, translated to higher clean accuracy on CIFAR-10.1. However, these gains were wiped out by moderate Gaussian noise. VGG8 dramatically outperforms both WideResNet variants subject to noise, achieving vs. . Unlike for VGG8, the WRN showed little generalization gap between noisy CIFAR-10 and 10.1 variants: is reasonably compatible with , and with . The Fixup variant improves accuracy by for noisy CIFAR-10, for noisy CIFAR-10.1, for PGD-, and for PGD-. We believe our work serves as a compelling motivation for Fixup and other techniques that aim to reduce usage of batch normalization. The role of skip-connections should be isolated in future work since absolute values were consistently lower for residual networks.

a.4 ImageNet Black-box Transferability Analysis

Target
11 13 16 19
Acc. Type Source
Top 1 11 1.2 42.4 37.8 42.9 43.8 49.6 47.9 53.8
58.8 0.3 58.2 45.0 61.6 54.1 64.4 58.7
Top 5 11 11.9 80.4 75.9 80.9 80.3 83.3 81.6 85.1
87.9 6.8 86.7 83.7 89.0 85.7 90.4 88.1
Table 10: ImageNet validation accuracy for adversarial examples transfered between VGG variants of various depths, indicated by number, with and without batch norm (“✓”, “✗”). All adversarial examples were crafted with BIM- using 10 steps and a step size of 5e-3, which is higher than for the white-box analysis to improve transferability. The BIM objective was simply misclassification, i.e., it was not a targeted attack. For efficiency reasons, we select 2048 samples from the validation set. Values along the diagonal in first two columns for Source = Target indicate white-box accuracy.

The discrepancy between the results in additive noise and for white-box BIM perturbations for ImageNet in Section 3 raises a natural question: Is gradient masking a factor influencing the success of the white-box results on ImageNet? No, consistent with the white-box results, when the target is unnormalized but the source is, top 1 accuracy is higher, while top 5 accuracy is higher, than vice versa. This can be observed in Table 10 by comparing the diagonals from lower left to upper right. When targeting an unnormalized model, we reduce top 1 accuracy by using a source that is also unnormalized, compared to a difference of only

by matching batch normalized networks. This suggests that the features used by unnormalized networks are more stable than those of batch normalized networks. Unfortunately, the pre-trained ImageNet models provided by the PyTorch developers do not include hyperparameter settings or other training details. However, we believe that this speaks to the generality of the results, i.e., that they are not sensitive to hyperparameters.

a.5 Batch Norm Limits Maximum Trainable Depth and Robustness

Figure 5: We repeat the experiment of [32] by training fully-connected models of depth and constant width (=384) with ReLU units by SGD, and learning rate for batch size on MNIST. We train for 10 and 40 epochs in LABEL: and LABEL: respectively. The batch norm parameters and were left as default, momentum disabled, and = 1e-3. Each coordinate is first averaged over three seeds. Diamond-shaped artefacts for unnormalized case indicate one of three seeds failed to train – note that we show an equivalent version of LABEL:

with these outliers removed and additional batch sizes from 5–20 in Figure 2. Best viewed in colour.

In Figure 5 we show that batch norm not only limits the maximum trainable depth, but robustness decreases with the batch size for depths that maintain test accuracy, at around 25 or fewer layers (in Figure 5). Both clean accuracy and robustness showed little to no relationship with depth nor batch size in unnormalized networks. A few outliers are observed for unnormalized networks at large depths and batch size, which could be due to the reduced number of parameter update steps that result from a higher batch size and fixed number of epochs [12]. Note that in Figure 5 the bottom row—without batch norm—appears lighter than the equivalent plot above, with batch norm, indicating that unnormalized networks obtain less absolute peak accuracy than the batch-normalized network. Given that the unnormalized networks take longer to converge, we prolong training for 40 epochs total. When they do converge, we see more configurations that achieve higher clean test accuracy than batch-normalized networks in Figure 5. Furthermore, good robustness can be experienced simultaneously with good clean test accuracy in unnormalized networks, whereas the regimes of good clean accuracy and robustness are still mostly non-overlapping in Figure 5.

Appendix B Weight Decay and Input Dimension

Consider a logistic classification model represented by a neural network consisting of a single unit, parameterized by weights and bias , with input denoted by and true labels . Predictions are defined by

, and the model is optimized through empirical risk minimization, i.e., by applying stochastic gradient descent (SGD) to the loss function (

2), where :

(2)

We note that is a scaled, signed distance between and the classification boundary defined by our model. If we define as the signed Euclidean distance between and the boundary, then we have: . Hence, minimizing (2) is equivalent to minimizing

(3)

We define the scaled loss as

(4)

and note that adding a regularization term in (3), resulting in (5), can be understood as a way of controlling the scaling of the loss function:

(5)
(a)
(b)
(c)
(d)
(e)
(f)
Figure 6: LABEL: For a given weight vector and bias , the values of over the training set typically follow a bimodal distribution (corresponding to the two classes) centered on the classification boundary. LABEL: Multiplying by the label allows us to distinguish the correctly classified data in the positive region from misclassified data in the negative region. LABEL: We can then attribute a penalty to each training point by applying the loss to . LABEL: For a small regularization parameter (large ), the misclassified data is penalized linearly while the correctly classified data is not penalized. LABEL: A medium regularization parameter (medium ) corresponds to smoothly blending the margin. LABEL: For a large regularization parameter (small ), all data points are penalized almost linearly.

In Figures 6(a)-6(c), we develop intuition for the different quantities contained in (2) with respect to a typical binary classification problem, while Figures 6(d)-6(f) depict the effect of the regularization parameter on the scaling of the loss function. To test this theory empirically we study a model with a single linear layer (number of units equals input dimension) and cross-entropy loss function on variants of MNIST of increasing input dimension, to approximate the toy model described in the “core idea” from [24] as closely as possible, but with a model capable of learning. Clearly, this model is too simple to obtain competitive test accuracy, but this is a helpful first step that will be subsequently extended to ReLU networks. The model was trained by SGD for 50 epochs with a constant learning rate of 1e-2 and a mini-batch size of 128. In Table 11 we show that increasing the input dimension by resizing MNIST from to various resolutions with PIL.Image.NEARESTinterpolation increases adversarial vulnerability in terms of accuracy and loss. Furthermore, the “adversarial damage”, defined as the average increase of the loss after attack, which is predicted to grow like by Theorem 4 of [24], falls in between that obtained empirically for and for all image widths except for 112, which experiences slightly more damage than anticipated. [24] note that independence between vulnerability and the input dimension can be recovered through adversarial-example augmented training by projected gradient descent (PGD), with a small trade-off in terms of standard test accuracy. We find that the same can be achieved through a much simpler approach: weight decay, with parameter chosen dependent on to correct for the loss scaling. This way we recover input dimension invariant vulnerability with little degradation of test accuracy, e.g., see the result for and in Table 11: the accuracy ratio is with weight decay regularization, compared to without. Compared to PGD training, weight decay regularization i) does not have an arbitrary hyperparameter that ignores inter-sample distances, ii) does not prolong training by a multiplicative factor given by the number of steps in the inner loop, and 3) is less attack-specific. Thus, we do not use adversarially augmented training because we wish to convey a notion of robustness to unseen attacks and common corruptions. Furthermore, enforcing robustness to -perturbations may increase vulnerability to invariance-based examples, where semantic changes are made to the input, thus changing the Oracle label, but not the classifier’s prediction [14]. Our models trained with weight decay obtained higher accuracy (86% vs. 74% correct) compared to batch norm on a small sample of 100 invariance-based MNIST examples.888Invariance based adversarial examples downloaded from https://github.com/ftramer/Excessive-Invariance. We make primary use of traditional perturbations as they are well studied in the literature and straightforward to compute, but solely defending against these is not the end goal. A more detailed comparison between adversarial training and weight decay can be found in [6]. The scaling of the loss function mechanism of weight decay is complementary to other mechanisms identified in the literature recently, for instance that it also increases the effective learning rate [31, 33]. Our results are consistent with these works in that weight decay reduces the generalization gap, even in batch-normalized networks where it is presumed to have no effect. Given that batch norm is not typically used on the last layer, the loss scaling mechanism persists in this setting, albeit to a lesser degree.

Model (Relative) Test Accuracy (Relative) Loss
Clean Clean Pred.
28 -
56 2
56 0.01 -
84 3
84 0.0225 -
112 4
112 0.05 -
Table 11: Mitigating the effect of the input dimension on adversarial vulnerability by correcting the margin enforced by the loss function. Regularization constant is for weight decay. Consistent with [24], we use -FGSM perturbations, the optimal attack for a linear model. Values in rows with are ratios of entry (accuracy or loss) wrt the baseline. “Pred.” is the predicted increase of the loss due to a small -perturbation using Thm. 4 of [24].
Model (Relative) Test Accuracy (Relative) Loss
BN Clean Clean
28
28
56
56
84
84
Table 12: Two-hidden-layer ReLU MLP (see main text for architecture), with and without batch norm (BN), trained for 50 epochs and repeated over five random seeds. Values in rows with are ratios wrt the baseline (accuracy or loss). There is a considerable increase of the loss, or similarly, a degradation of robustness in terms of accuracy, due to batch norm. The discrepancy for BIM- with for with batch norm represents a degradation in absolute accuracy compared to the baseline.

Appendix C Adversarial Spheres

The “Adversarial Spheres” dataset contains points sampled uniformly from the surfaces of two concentric -dimensional spheres with radii and respectively, and the classification task is to attribute a given point to the inner or outer sphere. We consider the case , that is, datapoints from two concentric circles. This simple problem poses a challenge to the conventional wisdom regarding batch norm: not only does batch norm harm robustness, it makes training less stable. In Figure 8 we show that, using the same architecture as in [8], the batch-normalized network is highly sensitive to the learning rate . We use SGD instead of Adam to avoid introducing unnecessary complexity, and especially since SGD has been shown to converge to the maximum-margin solution for linearly separable data [25]. We use a finite dataset of 500 samples from projected onto the circles. The unormalized network achieves zero training error for up to 0.1 (not shown), whereas the batch-normalized network is already untrainable at . To evaluate robustness, we sample 10,000 test points from the same distribution for each class (20k total), and apply noise drawn from . We evaluate only the models that could be trained to training accuracy with the smaller learning rate of . The model with batch norm classifies of these points correctly, while the unnormalized net obtains .

Figure 7: Two mini-batches from the “Adversarial Spheres” dataset (2D variant), and their representations in a deep linear network at initialization time LABEL: with batch norm and LABEL: without batch norm. Mini-batch membership is indicated by marker fill and class membership by colour. Each layer is projected to its two principal components. In LABEL: we scale both components by a factor of 100, as the dynamic range decreases with depth under default initialization. We observe in LABEL: that some samples are already overlapping at Layer 2, and classes are mixed at Layer 14.
Figure 8: We train the same two-hidden-layer fully connected network of width 1000 units using ReLU activations and a mini-batch size of 50 on a 2D variant of the “Adversarial Spheres” binary classification problem [8]. Dashed lines denote the model with batch norm. The batch-normalized model fails to train for a learning rate of , which otherwise converges quickly for the unnormalized equivalent. We repeat the experiment over five random seeds, shaded regions indicate a confidence interval.

Appendix D Author Contributions

In the spirit of [21], we provide a summary of each author’s contributions.

  • First author formulated the hypothesis, conducted the experiments, and wrote the initial draft.

  • Second author prepared detailed technical notes on the main references, met frequently with the first author to advance the work, and critically revised the manuscript.

  • Third author originally conceived the key theoretical concept of Appendix B as well as some of the figures, and provided important technical suggestions and feedback.

  • Fourth author met with the first author to discuss the work and helped revise the manuscript.

  • Senior author critically revised several iterations of the manuscript, helped improve the presentation, recommended additional experiments, and sought outside feedback.