Deep neural networks typically contain far more trainable parameters than training samples, which seems to easily cause a poor generalization performance. However, in fact they usually exhibit remarkably small generalization gaps. Traditional generalization theories such as VC dimension (Vapnik and Chervonenkis, 1991) or Rademacher complexity (Bartlett and Mendelson, 2002) cannot explain its mechanism. Extensive research focuses on the generalization ability of DNNs (Neyshabur et al., 2017; Arora et al., 2018; Keskar et al., 2016; Dinh et al., 2017; Hoffer et al., 2017; Novak et al., 2018; Dziugaite and Roy, 2017; Jakubovitz et al., 2019; Kawaguchi et al., 2017; Advani and Saxe, 2017).
Unlike that of shallow models such as logistic regression or support vector machines, the global minimum of high-dimensional and non-convex DNNs cannot be found analytically, but can only be approximated by gradient descent and its variants (Zeiler, 2012; Kingma and Ba, 2014; Graves, 2013). Previous work (Zhang et al., 2016; Hardt et al., 2015; Dziugaite and Roy, 2017) suggests that the generalization ability of DNNs is closely related to gradient descent optimization. For example, Hardt et al. (2015)
claims that any model trained with stochastic gradient descent (SGD) for reasonable epochs would exhibit small generalization error. Their analysis is based on the smoothness of loss function. In this work, we attempt to understand the generalization behavior of DNNs through GSNR and reveal how GSNR affects the training dynamics of gradient descent.
The GSNR of a parameter is defined as the ratio between its gradient’s squared mean and variance over the data distribution. Previous work tried to use GSNR to conduct theoretical analysis on deep learning. For example,Rainforth et al. (2018) used GSNR to analyze variational bounds in unsupervised DNNs such as variational auto-encoder (VAE). Here we focus on analyzing the relation between GSNR and the generalization gap.
Intuitively, GSNR measures the similarity of a parameter’s gradients among different training samples. Large GSNR implies that most training samples agree on the optimization direction of this parameter, thus the parameter is more likely to be associated with a meaningful “pattern” and we assume its update could lead to a better generalization. In this work, we prove that the GSNR is strongly related to the generalization performance, and larger GSNR means a better generalization.
To reveal the mechanism of DNNs’ good generalization ability, we show that the gradient descent optimization dynamics of DNN naturally leads to large GSNR of model parameters and therefore good generalization. Furthermore, we give a complete analysis and a detailed interpretation to this phenomenon. We believe this is probably the key to DNNs’ remarkable generalization ability.
2 Larger GSNR Leads to Better Generalization
In this section, we establish a quantitative relation between the GSNR of model parameters and generalization gap, showing that larger GSNR during training leads to better generalization.
2.1 Gradients Signal to Noise Ratio
Consider a data distribution , from which each sample is drawn; a model parameterized by ; and a loss function .
The parameters’ gradient w.r.t. and sample is denoted by
whose -th element is . Note that throughout this paper we always use to index data examples and to index model parameters.
Given the data distribution , we have the (sample-wise) mean and variance of . We denote them as and , respectively.
The gradient signal to noise ratio (GSNR) of one model parameter is defined as:
At a particular point of the parameter space, GSNR measures the consistency of a parameter’s gradients across different data samples. Figure 1 intuitively shows that if GSNR is large, the parameter gradient space tends to be distributed in the similar direction and if GSNR is small, the gradient vectors are then scatteredly distributed.
2.2 One-Step Generalization Ratio
In this section we introduce a new concept to help measure the generalization performance during gradient descent optimization, which we call one-step generalization ratio (OSGR). Consider training set with samples drawn from , and a test set . In practice we use the loss on to measure generalization. For simplicity, we assume the sizes of training and test datasets are equal, i.e. . We denote the empirical training and test loss as:
respectively. Then the empirical generalization gap is given by .
In gradient descent optimization, both the training and test loss would decrease step by step. We use and to denote the one-step training and test loss decrease during training, respectively. Let’s consider the ratio between the expectations of and of one single training step, which we denote as .
Note that this ratio also depends on current model parameters and learning rate . We are not including them in the above notation as we will not explicitly model these dependencies, but rather try to quantitatively characterize for very small and for at the early stage of training (satisfying Assumption 2.3.1).
Also note that the expectation of is over and . This is because the optimization step is performed on . We refer to as OSGR of gradient descent optimization. Statistically the training loss decreases faster than the test loss and (Middle panel of Figure 2), which usually results in a non-zero generalization gap at the end of training. If is large () in the whole training process (Right panel of Figure 2), generalization gap would be small when training completes, implying good generalization ability of the model. If is small (), the test loss will not decrease while the training loss normally drops (Left panel of Figure 2), corresponding to a large generalization gap.
2.3 Relation between GSNR and OSGR
In this section, we derive a relation between the OSGR during training and the GSNR of model parameters. This relation indicates that, for the first time as far as we know, the sample-wise gradient distribution of parameters is related to the generalization performance of gradient descent optimization.
In gradient descent optimization, we take the average gradient over training set , which we denote as . Note that we have used to denote gradient evaluated on one data sample and to denote its expectation over the entire data distribution. Similarly we define to be the average gradient over test set .
Both the training and test dataset are randomly generated from the same distribution , so we can treat and
as random variables. At the beginning of the optimization process,is randomly initialized thus independent of , so and would obey the same distribution. After a period of training, the model parameters begin to fit the training dataset and become a function of , i.e. , therefore distributions of and become different. However we choose not to model this dependency and make the following assumption for our analysis:
Assumption 2.3.1 (Non-overfitting limit approximation)
The average gradient over the training dataset and test dataset and obey the same distribution.
Obviously the mean of and is just the mean gradient over the data distribution .
We denote their variance as , i.e.
It is straightforward to show that:
where is the variance of the average gradient over the dataset of size , and is the variance of the gradient of a single data sample.
In one gradient descent step, the model parameter is updated by where is the learning rate. If is small enough, the one-step training and test loss decrease can be approximated by
Usually there are some differences between the directions of and , so statistically tends to be larger than and the generalization gap would increase during training. When , in one single training step the empirical generalization gap increases by , for simplicity we denote this quantity as :
Here we replaced the random variables by and , where and are random variables with zero mean and variance . Since , and are independent, the expectation of is
where is the variance the of average gradient of the parameter .
For simplicity, when it involves a single model parameter , we will use only a subscript instead of the full notation. For example, we use , , and to denote , , and respectively.
Consider the expectation of and when
Although we derived eq. (19
) from simplified assumptions, we can empirically verify it by estimating two sides of the equation on real data. We will elaborate on this estimation method in section2.4.
We can rewrite eq. (19) as:
We define to be the training loss decrease caused by updating . We can show that when is very small . Therefore when , we have
Eq. (22) shows that the GSNR plays a crucial role in the model’s generalization ability—the one-step generalization ratio in gradient descent equals one minus the weighted average of over all model parameters divided by . The weight is proportional to the expectation of the training loss decrease resulted from updating that parameter. This implies that larger GSNR of model parameters during training leads to smaller generalization gap growth thus better generalization performance of the trained model. Also note when , we have , meaning that training on more data helps generalization.
2.4 Experimental verification of the relation between GSNR and OSGR
The relation between GSNR and OSGR, i.e. eq. (19) or (22) can be empirically verified using any dataset if: (1) The dataset includes enough samples to construct many training sets and a large enough test set so that we can reliably estimate , and OSGR. (2) The learning rate is small enough. (3) In the early training stage of gradient descent.
To empirically verify eq. (19), we show how to estimate its left and right hand sides, i.e. OSGR by definition and OSGR as a function of GSNR. Suppose we have training sets each with size , and a test set of size . We initialize a model and train it separately on the training sets and test it with the same test set. For the -th training iteration, we denote the training loss and test loss of the model trained on the -th training dataset as and , respectively. Then the left hand side, i.e. OSGR by definition, of the -th iteration can be estimated by
For the model trained on the -th training set, we can compute the -th step average gradient and sample-wise gradient variance of on the corresponding training set, denoted as and , respectively. Therefore the right hand side of eq. (19) can be estimated by
We performed the above estimations on MNIST with a simple CNN structure consists of 2 Conv-Relu-MaxPooling blocks and 2 fully-connected layers. First, to estimate eq. (24) with , we randomly sample 10 training sets with size and a test set with size 10,000. To cover different conditions, we (1) choose , respectively; (2) inject noise by randomly changing the labels with probability ; (3) change the model structure by varying number of channels in the layers, . See Appendix A for more details of the setup. We use the gradient descent training (not SGD), with a small learning rate of . The left and right hand sides of 19 at different epochs are shown in Figure 3, where each point represents one specific choice of the above settings.
At the beginning of training, the data points are closely distributed along the dashed line corresponding to LHS=RHS. This shows that eq. (19) fits quite well under a variety of different settings. As training proceeds, the points become more scattered as the non-overfitting limit approximation no longer holds, but correlation between the LHS and RHS remains high even when the training converges (at epoch 2,500). We also conducted the same experiment on CIFAR10 A.2 and a toy dataset A.3 observed the same behavior. See Appendix for these experiments.
The empirical evidence together with our previous derivation of eq. (19) clearly show the relation between GSNR and OSGR and its implication in the model’s generalization ability.
3 Training dynamics of DNNs naturally leads to large GSNR
In this section, we analyze and explain one interesting phenomenon: the parameters’ GSNR of DNNs rises in the early stages of training, whereas the GSNR of shallow models such as logistic regression or support vector machines declines during the entire training process. This difference gives rise to GSNR’s large practical values during training, which in turn is associated with good generalization. We analyze the dynamics behind this phenomenon both experimentally and theoretically.
3.1 GSNR behavior of DNNs training
For shallow models, the GSNR of parameters decreases in the whole training process because gradients become small as learning converges. But for DNNs it is not the case. We trained DNNs on the CIFAR datasets and computed the GSNR averaged over all model parameters. Because and we assume is large, . In the case of only one large training datasets, we estimate GSNR of -th iteration by
As shown in Figure 4, the GSNR starts out low with randomly initialized parameters. As learning progresses, the GSNR increases in the early training stage and stays at a high level in the whole learning process. For each model parameter, we also computed the proportion of the samples with the same gradient sign, denoted as . In Figure 4c, we plot the mean of time series of this proportion for all the parameters. This value increases from about 50% (half positive half negetive due to random initialization) to about 56% finally, which indicates that for most parameters, the gradient signs on different samples become more consistent. This is because meaningful features begin to emerge in the learning process and the gradients of the weights on these features tend to have the same sign among different samples.
Previous research (Zhang et al., 2016) showed that DNNs achieved zero training loss by memorizing training samples even if the labels were randomized. We also plot the average GSNR for model trained using data with randomized labels in Figure 4 and find that the GSNR stays at a low level throughout the training process. Although the training loss of both the original and randomized labels go to zero (not shown), the GSNR curves clearly distinguish between these two cases and reveal the lack of meaningful patterns in the latter one. We believe this is the reason why DNNs trained on real and random data lead to completely different generalization behaviors.
3.2 Training Dynamics behind the GSNR behavior
In this section we show that the feature learning ability of DNNs is the key reason why the GSNR curve behavior of DNNs is different from that of shallow models during the gradient descent training. To demonstrate this, a simple two-layer perceptron regression model is constructed. A synthetic dataset is generated as following. Each data point is constructedi.i.d. using , where and
are drawn from uniform distributionand is drawn from uniform distribution
. The training set and test set sizes are 200 and 10,000, respectively. We use a very simple two-layer MLP structure with 2 inputs, 20 hidden neurons and 1 output.
We randomly initialized the model parameters and trained the model on the synthetic training dataset. As a control setup we also tried to freeze model weights in the first layer to prevent it from learning features. Note that a two layer MLP with the first layer frozen is equivalent to a linear regression model. That is, regression weights are learned on the second layer using fixed features extracted by the first layer. We plot the average GSNR of the second layer parameters for both the frozen and non-frozen cases. Figure5 shows that in the non-frozen case, the average GSNR over parameters of the second layer shows a significant upward process, whereas in the frozen case the average GSNR decreases in the beginning and remains at a low level during the whole training process.
In the non-frozen case, GSNR curve of individual parameters of the second layer are shown in Figure 5
. The GSNR for some parameters show a significant upward process. To measure the quality of these features, we computed the Pearson correlation between them and the target output, both at the beginning of training and at the maximum point of their GSNR curves. We can see that the learning process learns “good” features (high correlation value, i.e. with stronger correlation with ) from random initialized ones, as shown in Table 1. This shows that the GSNR increasing process is related to feature learning.
3.3 Analysis of training dynamics behind DNNs’ GSNR behavior
In this section, we will investigate the training dynamics behind the GSNR curve behavior. In the case of fully connected network structure, we can analytically show that the numerator of GSNR, i.e. the squared gradient mean of model parameters, tends to increase in the early training stage through feature learning.
Consider a fully connected network, whose parameters are , where are the weight matrix and bias of the first layer, and so on. We denote the activations of the -th layer as , where is the index for nodes/channels of this layer, and is the collection of model parameters in the layers before , i.e. . In the forward pass on data sample , is multiplied by the weight matrix :
where is the output of the matrix multiplication, for the -th data sample, on the -th layer, is the index of nodes/channels in the -th layer. We use to denote the average gradient of weights of the -th layer , i.e. , where is the loss of the -th sample.
Here we show that the feature learning ability of DNNs plays a crucial role in the GSNR increasing process. More precisely, we show that the learning of features , i.e. the learning of parameters tends to increase the absolute value of . Consider the one-step change of gradient mean with the learning rate . In one training step, is updated by . Using linear approximation with , we have
where and denote model parameters before and after the -the layer (including the -th), respectively.
The detailed derivation of eq. (28) can be found in Appendix 34. We can see the first term (which is a summation over parameters in ) in eq. (28) has opposite sign with . This term will make negatively correlated with . We plot the correlation between with for a model trained on MNIST for 200 epochs in Figure 6a. In the early training stage, they are indeed negatively correlated. For top-10% weights with larger absolute values, the negative correlation is even more significant.
Here we show that this negative correlation between and tends to increase the absolute value of through an interesting mechanism. Consider the weights with . Learning would decrease and thus increase its absolute value because the first term in eq. (28) is negative. On the other hand, learning would increase and its absolute value because is positive. This will form a positive feedback process, in which the numerator of GSNR, , would increase and so is the GSNR. Similar analysis can be done for the case with .
On the other hand, when , we show that the weights tend to change into the earlier case, i.e. during training. Consider the case of , the first term in eq. (28) is negative, learning tends to decrease or even change its sign. Another posibility is that learning changes the sign of because is negative. In both cases the weights change into the earlier case with . Similar analysis can be done for the case of .
Therefore is a more stable state in the training process. For a simple model trained on MNIST, We plot the proportion of weights satisfying in Figure 6b and find that there are indeed more weights with than the opposite. Because weights with small absolute value easily change sign during training, we also plot this proportion for the top-10% weights with larger absolute values. We can see that for the weights with large absolute values, nearly 80% of them have opposite signs with their gradient mean, confirming our earlier analysis. For these weights, the numerator of GSNR, , tends to increase through the positive feedback process as discussed above.
In this paper, we performed a series of analysis on the role of model parameters’ GSNR in deep neural networks’ generalization ability. We showed that large GSNR is a key to small generalization gap, and gradient descent training naturally incurs and exploits large GSNR as the model discovers useful features in learning.
- High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667. Cited by: §1.
- Stronger generalization bounds for deep nets via a compression approach. Note: arXiv:1802.05296 Cited by: §1.
Rademacher and gaussian complexities: risk bounds and structural results.
Journal of Machine Learning Research3, pp. 463–482. Cited by: §1.
- Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1019–1028. Cited by: §1.
- Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §1, §1.
AGenerating sequences with recurrent neural networks. Note: arXiv:1308.0850v5 Cited by: §1.
- Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240. Cited by: §1.
- Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems, pp. 1731–1741. Cited by: §1.
- Generalization error in deep learning. Springer. Cited by: §1.
- Generalization in deep learning. arXiv preprint arXiv:1710.05468. Cited by: §1.
- On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1.
- Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5947–5956. Cited by: §1.
- Sensitivity and generalization in neural networks: an empirical study. arXiv:1802.08760. Cited by: §1.
- Tighter variational bounds are not necessarily better. arXiv preprint arXiv:1802.04537. Cited by: §1.
- The necessary and sufficient conditions for consistency of the method of empirical risk. Pattern Recognition and Image Analysis 1 (3), pp. 284–305. Cited by: §1.
- ADADELTA: an adaptive learning rate method. Note: arXiv:1212.5701 Cited by: §1.
- Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1, §3.1.
Appendix A Appendix A
a.1 Model Structure in Section 2.4
As shown in Table 2, all models in the experiment consist of 2 Conv-Relu-MaxPooling blocks and 2 fully-connected layers, but they are different in the number of channels. We choose the number of channels from .
|Layer||input #channels||output #channels|
|conv + relu + maxpooling||1|
|conv + relu + maxpooling|
|fc + relu||16 *||10 *|
|fc + relu||10 *||10|
a.2 Experiment on CIFAR10
Different from the experiment on MNIST, we use a deeper network on CIFAR10. We also include the Batch Normalization (BN) layer, because we find that it’s difficult for the network to converge in the absence of it. The network consists of 4 Conv-BN-Relu-Conv-BN-Relu-MaxPooling blocks and 3 fully-connected layers. More details are shown in Table3.
|Layer||input #channels||output #channels|
|conv + bn + relu||3|
|conv + bn + relu|
|conv + bn + relu|
|conv + bn + relu|
|conv + bn + relu|
|conv + bn + relu|
|conv + bn + relu|
|conv + bn + relu|
|fc + relu||32 *||8 *|
|fc + relu||8 *||8 *|
The experiment is conducted under a similar setting as that of MNIST in section 2.4. We choose , , . We use the gradient descent training (Not SGD), with a small learning rate of . The left and right hand sides of 19 at different epochs are shown in Figure 7, where each point represents one specific combination of the above settings. Note that at the evaluation step of every epoch, we use the same mean and variance inside the BN layers as the training dataset. That’s to ensure that the network and loss function are consistent between training and test.
At the beginning of training, compared to that of MNIST, the data points no longer perfectly resides on the diagonal dashed line. We suppose that’s beacuse of the presence of BN layer, whose internal parameters, i.e. running mean and running variance, are not regular learnable parameters in the optimization process, but change their values in a different way. Their change affects the OSGR, yet we could not include them in the estimation of OSGR. However, the strong positive correlation between the left and right hand sides of eq. (19) can always be observed until the training begins to converge.
a.3 Experiment on Toy Dataset
In this section we show a simple two-layer regression model consists of a FC-Relu structure with only 2 inputs, 1 hidden layer with neurons and 1 output. A similar synthetic dataset with the training data used in the experiment of Section 3.2 is generated as follows. Each data point is constructed i.i.d. using , where and are drawn from uniform distribution of and is drawn from uniform distribution of .
To estimate eq. (24), we randomly generate 100 training sets with samples each, i.e. =100, and a test set with 20,000 samples. To cover different conditions, we (1) choose ; (2) inject noise with ; (3) perturb model structures by choosing . We use gradient descent with learning rate of 0.001.
Figure 8 shows a similar behavior as Fig. 3. During the early training stages, the LHS and RHS of eq. (19) are very close. Their highly correlated relation remains until training converges, whereas the RHS of eq. (19) decreases significantly.
Appendix B Appendix B
Appendix C Appendix C
|A data distribution satisfies|
|or||A single data sample|
|Training set consists of samples drawn from|
|Test set consists of samples drawn from|
|Model parameters, whose components are denoted as|
|or||Parameters’ gradient w.r.t. a single data sample or|
|Mean values of parameters’ gradient over a total data distribution, i.e.,|
|Average gradient over the training dataset, i.e.,|
|Average gradient over the test dataset, i.e., . Note that, in eq. (5), we assume|
|Variance of parameters’ gradient of a single sample, i.e.,|
|Variance of the average gradient over a training dataset of size , i.e.,|
|or||Gradient signal to noise ratio (GSNR) of model parameter|
|Empirical training loss, i.e.,|