Deep learning has been successfully applied to several machine learning tasks including visual object classificationhe2016identity ; Krizhevsky:2012wl , speech recognition Hinton:2012is Collobert:2008kg ; jozefowicz2016exploring Lee:2016vd , and all local minima attain similar loss values Choromanska:2015ui ; Dauphin:2014tn ; Kawaguchi:2016ub . However, the performance on test data, that is, generalizability, can be significantly different among these local minima. Despite the success of deep learning applications, we lack the understanding of generalizability, even though there has been progress Keskar:2017tz ; Zhang:2017te .
While understanding the generalizability of deep learning is an interesting topic, it is important from a practical point of view. For example, suppose that we are training a deep neural network using SGD and want to parallelize the process using multiple GPUs or nodes to accelerate the training. A well-known method for achieving this is synchronous SGD das2016distributed ; chen2016revisiting , which requires a large minibatch to effectively exploit parallel computation on multiple GPUs or nodes. However, it is reported that models trained through synchronous SGD with large minibatches exhibit poor generalization Keskar:2017tz , and a method is required to resolve this problem.
In this study, we consider the generalizability of deep learning from the perspective of sensitivity to input perturbation. Intuitively, if a trained model is insensitive or sensitive to the perturbation of an input, then the model is confident or not confident about the output, respectively. As the performance on test data is important, models that are insensitive to the perturbation of test data are required. Note that adversarial training Szegedy:2014vw ; Goodfellow:2015tl is designed to achieve insensitivity to the perturbation of training data, and it is not always effective for achieving insensitivity to the perturbation of test data.
To obtain insensitivity to the perturbation of test data, we propose a simple and effective regularization method, referred to as spectral norm regularization. As the name suggests, spectral norm regularization prevents the weight matrices used in neural networks from having large spectral norms. Through this, even though test data are not known in advance, a trained model is ensured to exhibit slight sensitivity to the perturbation of test data.
Using several real-world datasets, we experimentally confirm that models trained using spectral norm regularization exhibit better generalizability than models trained using other baseline methods. It is claimed in Keskar:2017tz
that the maximum eigenvalue of the Hessian predicts the generalizability of a trained model. However, we show that the insensitivity to the perturbation of test data is a more important factor for predicting generalizability, which further motivates the use of spectral norm regularization. Finally, we show that spectral norm regularization effectively reduces the spectral norms of weight matrices.
2 Related Works
A conventional method of understanding the generalizability of a trained model is the notion of the flatness/sharpness of a local minimum Hochreiter:1997wf . A local minimum is (informally) referred to as flat if its loss value does not increase significantly when it is perturbed; otherwise, it is referred to as sharp. In general, the high sensitivity of a training function at a sharp local minimizer negatively affects the generalizability of the trained model. In Hochreiter:1997wf , this is explained in more detail through the minimum description length theory, which states that statistical models that require fewer bits to describe generalize better Rissanen:1983fr .
It is known that SGD with a large minibatch size leads to a model that does not generalize well lecun2012efficient . In Keskar:2017tz , this problem is studied based on the flatness/sharpness of the obtained local minima. They formulated a flat local minimum as a local minimum at which all eigenvalues of the Hessian are small (note that all eigenvalues are non-negative at a local minimum). Using a proxy measure, they experimentally showed that SGD with a smaller minibatch size tends to converge to a flatter minimum.
The notion of flat/sharp local minima considers the sensitivity of a loss function against the perturbation of model parameters. However, it is natural to consider the sensitivity of a loss function against the perturbation of input data, as we discuss in this paper. InSzegedy:2014vw , the perturbation to training data that increases the loss function the most is considered, and the resulting perturbed training data are referred to as adversarial examples. It is reported in Goodfellow:2015tl that training using adversarial examples improves test accuracy.
Recently, Zhang:2017te showed that the classical notions of Rademacher complexity and the VC dimension are not adequate for understanding the generalizability of deep neural networks.
3 Spectral Norm Regularization
In this section, we explain spectral norm regularization and how it reduces the sensitivity to test data perturbation.
3.1 General idea
We consider feed-forward neural networks as a simple example to explain the intuition behind spectral norm regularization. A feed-forward neural network can be represented as cascaded computations,for for some , where is the input feature of the -th layer,
is a (non-linear) activation function, andand
are the layer-wise weight matrix and bias vector, respectively. For a set of parameters,, we denote by the function defined as . Given training data, , where and , the loss function is defined as , where is frequently selected to be cross entropy and the squared -distance for classification and regression tasks, respectively. The model parameter to be learned is .
Let us consider how we can obtain a model that is insensitive to the perturbation of the input. Our goal is to obtain a model, , such that the -norm of is small, where is an arbitrary vector and is a perturbation vector with a small
-norm. A key observation is that most practically used neural networks exhibit nonlinearity only because they use piecewise linear functions, such as ReLU , maxoutGoodfellow:2013tf , and maxpooling Ranzato:2007eb , as activation functions. In such a case, function is a piecewise linear function. Hence, if we consider a small neighborhood of , we can regard as a linear function. In other words, we can represent it by an affine map, , using a matrix, , and a vector, , which depend on and . Then, for a small perturbation, , we have
where is the spectral norm of . The spectral norm of a matrix is defined as
which corresponds to the largest singular value of . Hence, the function is insensitive to the perturbation of if the spectral norm of is small.
The abovementioned argument suggests that model parameter should be trained so that the spectral norm of is small for any . To further investigate the property of , let us assume that each activation function, , is an element-wise ReLU (the argument can be easily generalized to other piecewise linear functions). Note that, for a given vector, , acts as a diagonal matrix, , where an element in the diagonal is equal to one if the corresponding element in is positive; otherwise, it is equal to zero. Then, we can rewrite as . Note that for every . Hence, we have
It follows that, to bound the spectral norm of , it suffices to bound the spectral norm of for each . This motivates us to consider spectral norm regularization, which is described in the next section.
3.2 Details of spectral norm regularization
In this subsection, we explain spectral norm regularization. The notations are the same as those used in Section 3.1. To bound the spectral norm of each weight matrix, , we consider the following empirical risk minimization problem:
where is a regularization factor. We refer to the second term as the spectral norm regularizer. It decreases the spectral norms of the weight matrices.
When performing SGD, we need to calculate the gradient of the spectral norm regularizer. To this end, let us consider the gradient of for a particular . Let and be the first and second singular values, respectively. If , then the gradient of is , where and are the first left and right singular vectors, respectively. If , then is not differentiable. However, for practical purposes, we can assume that this case never occurs because numerical errors prevent and from being exactly equal.
As it is computationally expensive to compute , , and , we approximate them using the power iteration method. Starting with a randomly initialized , we iteratively perform the following procedure a sufficient number of times: and , and . Then, , , and converge to , , and , respectively (if ). To approximate , , and in the next iteration of SGD, we can reuse as the initial vector. in our experiments, which are explained in Section 4, we performed only one iteration because it was adequate for obtaining a sufficiently good approximation. A pseudocode is provided in Algorithm 1.
We describe how to handle convolutions because they are used widely in recent applications of deep neural networks. Consider a convolutional layer with input channels, output channels, and a -sized kernel. This implies that the convolution has parameters. Note that a value in an output channel is determined using values in the input channels. Hence, we align the parameters as a matrix of size and apply the abovementioned power iteration method to the matrix to calculate its spectral norm and gradient.
3.3 Comparison with other regularization methods
We now compare the spectral norm regularization with other regularization techniques.
Weight decay, or the Frobenius norm regularization, is a well-known regularization technique for deep learning. It considers the following problem:
where is a regularization factor. We note that , where is the -th singular value of . Hence, the Frobenius norm regularization reduces the sum of squared singular values. Even though it will be effective to train models that are insensitive to input perturbation, a trained model may lose important information about the input because each trained weight matrix, , acts as an operator that shrinks in all directions. In contrast, spectral norm regularization focuses only on the first singular value, and each trained weight matrix, , does not shrink significantly in the directions orthogonal to the first right singular vector.
Adversarial training Goodfellow:2015tl considers the following problem:
are hyperparameters. It considers the perturbation toward the direction that increases the loss function the most. Hence, a trained model is insensitive to the adversarial perturbation of training data. In contrast, spectral norm regularization automatically trains a model that is insensitive to the perturbation of training data and test data.
Another method of reducing the sensitivity of a model against input perturbation is penalizing the -norm of the derivative of the output with respect to the input. Let us denote the Jacobian matrix of with respect to by . The Jacobian regularization considers the following regularization term:
The Jacobian regularization promotes the smoothness of a model against input perturbation. However, this regularization is impractical because calculating the derivative of a Jacobian with respect to parameters is computationally expensive. To resolve the issue, Gu et al. gu2014towards proposed an alternative method that regularizes layer-wise Jacobians:
where is the input to the -th layer calculated using . Note that, if we neglect the effect of activation functions between layers, this regularization coincides with weight decay. Hence, we exclude this regularization from our experiments.
In this section, we experimentally demonstrate the effectiveness of spectral norm regularization on classification tasks over other regularization techniques, and confirm that the insensitivity to test data perturbation is an important factor for generalization.
All the training methods discussed here are based on stochastic gradient descent (SGD). We consider two regimes for the choice of the mini-batch size. In the small-batch regime, we set the mini-batch size to , and in the large-batch regime, we set it to . In our experiments, we compared the following four problems:
Vanilla problem (vanilla): As a vanilla problem, we considered empirical risk minimization without any regularization, that is, , where is the cross entropy.
Weight decay (decay): We considered the problem (2), where is the cross entropy. We selected the regularization factor .
Spectral norm regularization (spectral): We considered the problem (1), where is the cross entropy. We selected the regularization factor .
We trained models using Nesterov’s accelerated gradient descent Bengio:2013fn with momentum 0.9. We decreased the learning rate by a factor of when the half and the three quarters of the training process have passed. We optimized the hyper-parameters through a grid search and selected those that showed a reasonably good performance for every choice of neural network and mini-batch size.
In our experiments, we used the following four settings on the model and dataset.
The densely connected convolutional network (DenseNet) Huang:2016wa having a depth of 40 on the CIFAR-100 dataset.
DenseNet having a depth of 22 on the STL-10 dataset Coates:2011wo . The depth was decreased because of the memory consumption issue.
We preprocessed all the datasets using global contrast normalization. We further applied data augmentation with cropping and horizontal flip on STL-10 because the number of training data samples is only 5,000, which is small considering the large mini-batch size of 4096. The learning rate of SGD was initialized to in the small-batch regime and in the large-batch regime for the NIN and DenseNet models on the CIFAR-100 dataset, and was initialized to in the small-batch regime and in the large-batch regime for the VGGNet and DenseNet models on the CIFAR-10 dataset.
4.1 Accuracy and Generalization Gap
|Test accuracy||Generalization gap|
First, we look at the test accuracy obtained by each method, which is summarized in the left columns of Table 1. In the small-batch regime, decay and spectral show better test accuracies than the other two methods. In the large-batch regime, spectral clearly achieves the best test accuracy for every model. Although the test accuracy decreases as the mini-batch size increases, as reported in Keskar:2017tz , the decrease in the test accuracy of spectral is less significant than those of the other three methods.
Next, we look at the generalization gap. We define the generalization gap at a threshold as the minimum difference between the training and test accuracies when the test accuracy exceeds . The generalization gap of each method is summarized in the right columns of Table 1. For each setting, we selected the threshold so that every method achieves a test accuracy that (slightly) exceeds this threshold. For each of the settings, except for the DenseNet model on the STL-10 dataset, spectral clearly shows the smallest generalization gap followed by decay, which validates the effectiveness of spectral.
Figure 1 shows the training curve of the DenseNet model on the CIFAR-100 dataset. The results for other settings are given in Appendix A. As we can observe, in both the small-batch and large-batch regimes, spectral shows the smallest generalization gaps. The generalization gap of decay increases as the mini-batch size increases, whereas that of spectral does not increase significantly. Investigating the reason behind this phenomena is an interesting future work.
The choice of the mini-batch size and the training method is not important to obtain a model with a good training accuracy; all of them exceeds 95%. However, obtaining a model with a good test accuracy, or a small generalization gap, is important. In the subsequent sections, we investigate which property of a trained model determines its generalization gap.
To summarize, spectral consistently achieves the small generalization gap and shows the best test accuracy, especially in the large-batch regime.
4.2 Sensitivity to the perturbation of the input
To understand the relation between the sensitivity to the input perturbation and the generalization gap of the trained model, we look at the gradient of the loss function, defined with the training data or the test data, with respect to the input. Figure 2 shows the transition of the -norm when training the DenseNet model on the CIFAR-100 dataset. The results for other settings are given in Appendix B.
We can observe in Figure 2(b) that the -norm of the gradient with respect to the test data is well correlated with the generalization gap. In particular, the -norm of the gradient gradually increases as training proceeds, which matches the behavior of the generalization gap, which also increases as training proceeds.
On the contrary, as shown in Figure 2(a), although the -norm of the gradient gradually decreases as training proceeds, the generalization gap actually increases. Hence, the -norm of the gradient with respect to the training data does not predict the generalization gap well.
These results motivate us to reduce the -norm of the gradient with respect to the test data to obtain a smaller generalization gap. As we do not know the test data a priori, the spectral norm regularization method is reasonable to achieve this goal because it reduces the gradient at any point.
The superiority of spectral to decay can be elucidated from this experiment. In order to achieve a better test accuracy, we must make the training accuracy high and the generalization gap small. To achieve the latter, we have to penalize the -norm of the gradient with respect to the test data. To this end, decay suppresses all the weights, which decreases model complexity, and hence, we cannot fit the model to the training data well. On the other hand, spectral only suppresses the spectral norm, and hence, we can achieve a greater model complexity than decay and fit the model to the training data better.
4.3 Maximum eigenvalue of the Hessian with respect to the model parameters
In Keskar:2017tz , it is claimed that the maximum eigenvalue of the Hessian of the loss function defined with the training data predicts the generalization gap well. To confirm this claim, we computed the maximum eigenvalue of the DenseNet model trained on the CIFAR-100 dataset, shown in Figure 4. As it is computationally expensive to compute the Hessian, we approximated its maximum eigenvalue by using the power iteration method because we can calculate the Hessian-vector product without explicitly calculating the Hessian Martens:2010vo . We also computed the maximum eigenvalue of the Hessian of the loss function defined with the test data.
We can observe that, for vanilla, larger eigenvalues (in both the training and test data) are obtained if the mini-batch size is increased, which confirms the claim of Keskar:2017tz . However, the models trained with regularizations tend to have larger eigenvalues, although they have better generalizability. In particular, the models trained by spectral have the largest eigenvalues, although they have the best generalizability as we have seen in Section 4.1.
This phenomena can be understood as follows: If we train a model without regularization, then a small perturbation does not significantly affect the Frobenius and spectral norms of weight matrices because they are already large. However, if we train a model with regularization, then because of these small norms a small perturbation may significantly affect those norms, which may cause a significant change in the output.
To summarize, this experiment indicates that the maximum eigenvalue of the Hessian of the loss function is not a suggestive measure for predicting generalizability.
4.4 Singular values of weight matrices
Finally, we look at the singular values of weight matrices in the models trained by each method. Figures 4 shows the singular values of a weight matrix taken from the DenseNet model on the CIFAR-100 dataset. The matrix is selected arbitrarily because all matrices showed similar spectra.
We can observe that the spectrum of vanilla
is highly skewed, andadversarial and decay shrink the spectrum while maintaining the skewed shape. In contrast, the spectrum of spectral is flat. This behavior is as expected because the spectral norm regularization tries to reduce the largest singular value. Because the maximum singular value obtained by spectral is low, we obtain less sensitivity to the perturbation of the input.
In this work, we hypothesized that a high sensitivity to the perturbation of the input data degrades the performance of the data. In order to reduce the sensitivity to the perturbation of the test data, we proposed the spectral norm regularization method, and confirmed that it exhibits a better generalizability than other baseline methods through experiments. Experimental comparison with other methods indicated that the insensitivity to the perturbation of the test data plays a crucial role in determining the generalizability.
There are several interesting future directions to pursue. It is known that weight decay can be seen as a regularization in MAP estimation derived from a Gaussian prior to the model parameters. Is it possible to understand spectral norm regularization as a derivation of a prior? We also need a theoretical understanding of the effect of spectral norm regularization on generalization. It is known that, in some ideal cases, weight decay improves generalizability by preventing neural networks from fitting noisesKrogh:1991uo . Can we extend this argument to spectral norm regularization?
The authors thank Takuya Akiba and Seiya Tokui for helpful discussions.
- (1) M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In ICML, pages 1120–1128, 2016.
- (2) Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In ICASSP, pages 8624–8628, 2013.
- (3) J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous SGD. 2016. arXiv:1604.00981.
- (4) A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, pages 192–204, 2015.
- (5) A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. AISTATS, 2011.
- (6) R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pages 160–167, 2008.
- (7) D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan, D. Kalamkar, B. Kaul, and P. Dubey. Distributed deep learning using synchronous stochastic gradient descent. 2016.
- (8) Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, pages 2933–2941, 2014.
- (9) I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
- (10) I. J. Goodfellow, D. Warde-Farley, M. Mirza, and A. C. Courville. Maxout networks. In ICML, pages 1319–1327, 2013.
- (11) S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
K. He, X. Zhang, S. Ren, and J. Sun.
Identity mappings in deep residual networks.
European Conference on Computer Vision, pages 630–645. Springer, 2016.
- (13) G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
- (14) S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
- (15) G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. 2016. arXiv:1608.06993.
- (16) R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. 2016. arXiv:1602.02410.
- (17) K. Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586–594, 2016.
- (18) N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning - generalization gap and sharp minima. In ICLR, 2017.
- (19) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
- (20) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, pages 1097–1105, 2012.
- (21) A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In NIPS, pages 950–957, 1991.
- (22) Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
- (23) J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In COLT, pages 1246–1257, 2016.
- (24) M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
- (25) J. Martens. Deep learning via hessian-free optimization. In ICML, pages 735–742, 2010.
- (26) M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, pages 1–8, 2007.
- (27) J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of statistics, 11(2):416–431, 1983.
- (28) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. 2014. arXiv:1409.1556v6.
- (29) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR, 2014.
- (30) S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas. Full-capacity unitary recurrent neural networks. In NIPS, pages 4880–4888, 2016.
- (31) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Uunderstanding deep learning requires rethinking generalization. In ICLR, 2017.
Appendix A Accuracy
The training curves for the VGGNet, NIN, and DenseNet models on the STL-10 dataset are shown in Figure 5. We can observe that, in every setting, spectral shows the smallest generalization gap or the best test accuracy, which demonstrates that spectral can effectively reduce the generalization gap without suppressing the model complexity significantly.
Appendix B Sensitivity to the perturbation of the input
Figure 6 shows the transition of the -norm of the gradient of the loss function, defined with the training data or test data, with respect to the input. For every setting, the -norm of the gradient with respect to the test data is well-correlated with the generalization gap. On the contrary, for the VGGNet and NIN models, the -norm of the gradient with respect to the training data does not predict generalization gap well.