Stochastic gradient descent (SGD) and its variants have been surprisingly successful for training complex deep nets [80, 67, 52]. The surprise comes from two aspects: first, SGD is able to ‘solve’ such non-convex optimization problems, and second, the solutions typically have good generalization performance. While numerous commercial, scientific, and societal applications of deep nets are being developed every day [71, 48, 18]
, understanding the optimization and generalization aspects of SGD for deep nets has emerged as a key research focus for the core machine learning community.
In this paper, we present new empirical and theoretical results on both the optimization dynamics and generalization behavior of SGD for deep nets. For the empirical study, we view each run of SGD as a realization of a stochastic process in line with recent perspectives and advances [11, 44, 76, 29]
. We repeat each experiment, i.e., training a deep net on a data set from initialization to convergence, 10,000 times to get a full distributional characterization of the stochastic process, including the dynamics of the value, gradient, and Hessian of the loss.Thus, rather than presenting an average over 10 or 100 runs, we often show trajectories of different quantiles of the loss and associated quantities, giving a more complete sense of the empirical behavior of SGD.
We consider three key research questions in the paper. First, how does the Hessian of the loss relate to the second moment of the stochastic gradients
(SGs)? In general, since the loss is non-convex, the Hessian will be indefinite with both positive and negative eigenvalues and the second moment of SGs, by definition, will be positive (semi-)definite (PSD). The subspace corresponding to the top (positive) eigenvalues of the second moment broadly captures the preferred direction of the SGs. Does this primary subspace of the second moment overlap substantially with the subspace corresponding to the top positive eigenvalues of the loss Hessian? We study the dynamics of the relationship between these subspaces till convergence for a variety of problems (Section4). Interestingly, the top sub-spaces do indeed align quite persistently over iterations and for different problems. Thus, within the scope of these experiments, SGD seems to be picking up and using second order information of the loss.
Second, how does the empirical dynamics of SGD look like as a stochastic process and how do we characterize such dynamics and convergence in theory? We study the empirical dynamics of the loss, cosine of subsequent SGs, and norm of the SGs based on 10,000 runs to get a better understanding of the stochastic process [41, 81, 17, 3, 62, 16, 79, 8, 41]
. Under simple assumptions, we then present a distributional characterization of the loss dynamics as well as large deviation bounds for the change in loss at each step using a characterization based on a suitable martingale difference sequence. Special cases of the analysis for over-parameterized least squares and logistic regression provide helpful insights into the stochastic dynamics. We then illustrate that adaptive step sizes or adaptive diagonal preconditioning can be used to convert the dynamics into a super-martingale. Under suitable assumptions, we characterize such super-martingale convergence as well as rates of convergence of such adaptive step size or preconditioned SGD to a stationary point.
Third, can we develop a scale-invariant generalization bound which considers the structure of the Hessian at minima obtained from SGD
? While the Hessian at minima contains helpful information such as local flatness, the Hessian changes if the deep net is reparameterized. We develop a PAC Bayesian bound based on an anisotropic Gaussian posterior whose precision (inverse covariance) matrix is based on a suitably thresholded version of the Hessian. The posterior is intuitively meaningful, e.g., flat directions in the Hessian have large variance in the posterior. We show that while Hessian itself changes due to re-parameterization, the KL-divergence between the posterior and prior does not, yielding a scale invariant generalization bound. The bound revels a dependency and also trade-off between two scale-invariant terms: a measure of effective curvature and a weighted Frobenius norm of the change in parameters from initialization. Both terms remain unchanged even if the deep net is re-parameterized. We also show empirical results illustrating that both terms stay small for simple problems and they increase for hard learning problems.
Our experiments explore the fully connected feed-forward network with Relu activations. We evaluate SGD dynamics on both synthetic datasets and some commonly used real datasets, viz., the MNIST database of handwritten digits and the CIFAR-10 dataset . The synthetic datasets, which are inspired by recent work in [64, 80], consist of equal number of samples drawn from isotropic Gaussians with different means, each corresponding to one class. We refer to these datasets as Gauss-. We also consider variants of these datasets where a fixed fraction of points have been assigned random labels  without changing the features. Details of our experimental setup are discussed in the Appendix. Experiments on both Gauss-10 and Gauss-2 datasets are repeated 10,000 times for each setting to get a full distributional characterization of the loss stochastic process and associated quantities, including full eigen-spectrum of the Hessian of the loss and the second moment. In the main paper, we present results based on Gauss-10 and some on MNIST and CIFAR-10. Additional results on all datasets and all proofs are discussed in the Appendix.
The rest of the paper is organized as follows. We start with a brief discussion on related work in Section 2. In Section 3, we introduce the notations and preliminaries for the paper. In Section 4, we investigate the relationship between the Hessian of the loss and the second moment of the SGs. In Section 5, we characterize the dynamics of SGD both empirically and in theory by treating SGD as a stochastic process, and reveal the influence of the dynamics of Hessian and second moment on such dynamics. In Section 6, we present a PAC-Bayesian based scale-invariant generalization bound which balances the effect of the local structure of the Hessian and the change in parameters from initialization. Finally, we conclude the paper in section 7.
2 Related Work
Hessian Decomposition. The relationship between the stochastic gradients and the Hessian of the loss has been studied by decomposing the Hessian into the covariance of stochastic gradients and the averaged Hessian of predictions [76, 64, 63]. [64, 63]
found the eigen-spectrum of the Hessian after convergence are composed of a ‘bulk’ which corresponds to large portion of zero and small eigenvalues and ‘outliers’ which corresponds to large eigenvalues. Later on,[53, 54] found that the ‘outliers’ of the Hessian spectrum come from the covariance of the stochastic gradients and the ‘bulk’ comes from the averaged Hessian of predictions. [23, 20] studied the dynamics of the gradients and the Hessian during training and found that large portion of the gradients lie in the top-eigenspace of the Hessian. 
found that using batch normalization suppresses the outliers of Hessian spectrum and the stochastic covariance spectrum.
SGD dynamics and convergence. To understand how SGD finds minima which generalizes well, various aspects of SGD dynamics have been studied recently. [76, 29] studied the SGD trajectory and deduced that SGD bounces off walls of a valley-like-structure where the learning rate controls the height at which SGD bounces above the valley floor while batch size controls gradient stochasticity which facilitates exploration. [11, 28, 44]
studied the stationary distribution of SGD using characterizations based stochastic partial differential equations. In particular, proposed that constant learning rate SGD approximates a continuous time stochastic process (Ornstein-Uhlenbeck process) with a Gaussian stationary distribution. However, the assumption of constant covariance has been considered unsuitable for deep nets [76, 64]. There have been work studying SGD convergence and local minima. [62, 16]
proved the probability of hitting a local minima in Relu neural network is quite high, and increases with the network width. The convergence of SGD for deep nets has been extensively studied and it has been observed that over-parameterization and proper random initialization can help the optimization in training neural networks[15, 3, 1, 41, 42]. With over-parameterization and random initialization, GD and SGD can find the global optimum in polynomial time for deep nets with Relu activation [41, 81]15]. Linear rate of SGD for optimizing over-parameterized deep nets is observed in some special cases and assumptions [17, 1, 3]. However,  showed that for linear deep nets, the number of iterations required for convergence scales exponentially with the depth of the network, which is opposite to the idea that increasing depth can speed up optimization .
Generalization. Traditional approaches attribute small generalization error either to properties of the model family or to the regularization techniques. However, these methods fail to explain why large neural networks generalize well in practice . Recently, several interpretations have been proposed [52, 67]. The concept of generalization via achieving flat minima was first proposed in . Motivated by such idea,  proposed the Entropy-SGD algorithm which biases the parameters to wide valleys to guarantee generalization.  showed that small batch size can help SGD converge to flat minima. However, for deep nets with positively homogeneous activations, most measures of sharpness/flatness and norm are not invariant to rescaling of the network parameters, corresponding to the same function (“-scale transformation” ). This means that the measure of flatness/sharpness can be arbitrarily changed through rescaling without changing the generalization performance, rendering the notion of ‘flatness’ meaningless. To handle the sensitive to reparameterization,  explained the generalization behavior through ‘Bayesian evidence’, which penalizes sharp minima but is invariant to model reparameterization. In addition, some spectrally-normalized margin generalization bounds have proposed which depend on the product of the spectral norms of layers [5, 51].  proposed deterministic margin bound based on suitable derandomization of PAC-Bayesian bounds in order to address the exponential dependence on the depth [5, 51]. Most recently, [57, 69, 77] explored scale-invariant flatness measure for deep network minima.
In this section, we set up notations and discuss preliminaries which will be used in the sequel. For simplicity, we denote the
-th entry of a vectoras . Let be a fixed but unknown distribution over a sample space and let
be a finite training set drawn independently from the true distribution . For -class classification problems, we have
where is a data sample, is the corresponding label, and
is the set of labels. In this paper, we focus on empirical and theoretical analysis of SGD applied to deep nets such as feed forward neural networks, and convolutaional networks[37, 22]. With denoting the parameters of the deep net, the empirical loss on the training set is
for a suitable point-wise loss . The gradient of the empirical loss is
and the covariance of the sample gradient is
be the Hessian of the empirical loss and
be the second moment of sample gradient . Note that these quantities are all defined in terms of the training sample .
where is the step size and is the stochastic gradient (SG) typically computed from a mini-batch of samples. Let be the batch size, so that . We denote the mean and covariance of SG as and respectively, and we have and . For convenience, we introduce the following notation:
Let be the output of the deep net for a -class classification problem, then the prediction probability of true label is given by:
where is the -th entry of . In this paper, we consider the log-loss, also known as the cross-entropy loss, given by
4 Hessian of the Loss and Second Moment of SGD
In this section, we investigate the relationship between the Hessian of the training loss and the second moment of SGs
. We compute and compare the eigenvalues and eigenvectors of bothand . Our experimental results show that the primary subspaces of and overlap while the eigenvalue distributions (eigen-spectra) of the two matrices have substantial differences. We also illustrate that the overlap of the primary subspaces cannot be quite explained based the Fisher information matrix. Different from [23, 20], our work not only focuses on the full eigenvalue distribution at the beginning and the end of SGD, but also reveals how such distribution evolves during the training by providing additional result at intermediate iteration. Comparing with , we use a more well-established metric to characterize the overlap between two subspaces.
(right) for Gauss-10. Solid lines represent the mean value over all runs, and shaded bands indicate the 95% confidence intervals.to , indicating the top 10 principal subspaces are well aligned.
4.1 Hessian Decomposition
For and as defined in (6) and , we have
Figure 1 shows the full eigen-spectrum (averaged over 10,000 runs) of , , and the residual term at the first, one intermediate, and the last iteration of SGD trained on Gauss-10. Overall, the eigenvalue dynamics of and exhibit similar trend (Figure 1, 2: middle and right) with slight increase at early iterations, followed by a decrease. Eigenvalues of usually drop faster than those of (Figure 2: middle and right). In general, when the training data has classes, is a positive definite matrix with an order of non-trivial eigenvalues .
For simple problems in which data are easy to separate, as SGD converges, we would expect the average gradient as well as the gradient for each individual sample to be close to 0 [15, 42]. Thus approaches 0 as almost all non-trivial eigenvalues vanish (Figure 1 (a): middle). As a result, the residual approaches (Figure 1 (a): left and right).
When dealing with a hard problem (Figure 1(b)), at convergence, even though the average gradient may be close to 0, certain individual gradients need not. Therefore, may still have a handful of non-trivial eigenvalues, but they are at least one order of magnitude smaller than the top eigenvalues of (Figure 1(b): left and middle). Such an observation is not only true for Guass-10 with 15% random labels, but also observed on MNIST (Figure 3: middle and right) and CIFAR-10 dataset (Figure 4: middle and right). Hence, the top eigenvalues of are much closer to than those of . The bottom eigenvalues of always approaches since only has non-negative eigenvalues. Overall, the eigen-spetrum of looks very similar to close to convergence. Our observations disagree with existing claims  which suggest that is almost equal to near the minima by assuming the residual term disappears.
4.2 Top Subspaces: Hessian and Second Moment
Proposition 1 indicates that, at any time during training, and are related but differ by a residual term . Our empirical results show that the impact of on the eigen-spectrum dynamics of and is persistent and not negligible. But how does impact the primary subspaces, i.e., corresponding to the largest positive eigenvalues, of and ?
To answer this question, we carefully assess the overlap between the principal eigen-spaces of and determined by the eigenvectors corresponding to the top (positive) eigenvalues of these matrices during training based on principal angles; additional results based on Davis-Kahan [13, 78] perturbation can be found in the appendix. Recall that principal angles  between two subspaces and in , whose dimensions satisfy , are defined recursively by
Let and be two orthogonal matrices, e.g., eigenvectors of and , whose range are and respectively, then we have , where
denotes the singular values of. Such relationships are also considered in Canonical Correlation Analysis (CCA) .
Figures 2-4 show the evolution of the subspace overlap between and in terms of the top-15 principal angles for Gauss-10, MNIST, and CIFAR-10 datasets. All datasets have 10 classes. The key observation is the top 10 principal subspaces of and quickly align with each other and overlap almost completely during the entire training period. Additional results for Gauss-2 dataset whose top 2 principal subspaces overlap can be found in the appendix (Figure LABEL:fig:principal_angle_k2). Such persistent overlap occurs in both synthetic and real datasets, suggesting that the second moment of the SGs somehow carry second order information about the loss.
Notice that in some scenarios, e.g., MNIST dataset (Figure 3 (b)), the cosine value of all 15 principal angles stays high. However, comparing with its top 10 eigenvalues, the remaining eigenvalues of have much smaller values (Figure 3 (b):right). Thus, the overlap in these subspaces will not significantly affect the behavior of SGD.
|Least Squares||Binary Logistic Regression|
4.3 Relationship with Fisher Information
The observation that the primary subspaces of indeed overlap with the primary suspaces is somewhat surprising. One potential explanation can be based on connecting the decomposition in Proposition 1 with the Fisher Information matrix.
In particular, using integration by parts and recalling that , we in fact have
which is line with Proposition 1. However, under the regularity conditions, we have (see Appendix B.3) , which makes the two forms of equal. However, for finite samples, the expectation is replaced by , so the term does not cancel out, and the finite sample in Proposition 1 does not become zero. In addition, during the SGD iterations are not the true parameter , so the quantities involved in Proposition 1 are not quite the finite sample versions of Fisher Information due to model misspecification.
) are not the same possibly due to the finite sample over-parameterized setting, inaccurate estimate of, and the non-smoothness of the Relu activation.
Especially when the model is over-parameterized, even for smooth loss functions, we may still observeto be different from , e.g., see the analysis on over-parameterized least squares (Example 1) and over-parameterized logistic regression (Example 2) below. Thus, Fisher information alone is not sufficient to explain the overlap between the primary subspaces of and .
Example 1 (Least Squares)
Let be the design matrix and be the response vector for training samples. Given a sample defined in (2), we assume the following linear relationship holds: , where is a Gaussian noise with mean and variance . The empirical loss of the least squares problem is given by
The Hessian of the empirical loss , the second moment , and the residual term can be directly calculated (see Appendix B.4.1) and has been summarized in Table 1 (second column). In high dimensional case when , the optimal solution satisfies . As approaches , we have
so that and .
Example 2 (Logistic Regression)
The empirical loss of binary logistic regression for observations and is given by
The Hessian of the empirical loss , the second moment , and the residual term can be calculated directly (see Appendix B.4.2) and has been summarized in Table 1 (last column). In the high dimensional case, the data are always linearly separable. Thus can be arbitrarily small, depending on . When approaches the optimum, we have , therefore
so that approaches and as .
4.4 Layer-wise Hessian
We also provide a layer-wise analysis of the curvature as obtained from the Hessian. The Hessian of a 2-hidden layer Relu network can be thought of as a collection of several block matrices (Figure 5). We use to denote the lowest layer which connects to the input and corresponds to the output layer. Let denote the diagonal blocks of , such that each element in is the partial derivative taken with respect to the weights in the same layer,
where is the output of the network defined in Section 3, is the weight at the layer connecting the node from the previous layer and node at the layer , and is the Hessian of the sub-network starting from the layer , with being the full Hessian , and being the same as . From (21), every matrix is positive semi-definite (PSD) since , the Hessian of the logistic loss, is PSD. The definitions of and has been depicted in Figure 5. Figure 6 shows the eigen-spectrum dynamics of and , for . We observe that the top eigenvalus of all are of the same order of magnitude. To get a better sense of which layer contributes more, we also analyze the eigenvectors corresponding to the top eigenvalues of . We evaluate the magnitude of the eigenvector components corresponding to each layer, normalized by the layer size, and the results can be found in Figure 7. Overall, for simple problem, layer 2 (connected to the output) always has the largest value while layer 0 contributes less. Such a relationship holds for hard problem at the beginning of the training, then the difference among the 3 layers shrinks, and eventually all layers have almost equal contributions.
5 SGD Dynamics
In this section, we study the empirical dynamics of the loss, cosine of the angle between subsequent SGs, and the norm of the SGs based on constant step-size SGD for fixed batch sizes and averaged over 10,000 runs. We then present a distributional characterization of the loss dynamics as well as a large deviation bound of the change in loss at each step. Further, we specialize the analysis for the special cases of least squares regression and logistic regression to gain insights for these cases. Finally, we present convergence results to a stationary point for mini-batch SGD with adaptive step sizes as well as adaptive preconditioning.
5.1 Empirical Loss Dynamics
The stochastic parameter dynamics of as in (8) and associated quantities such as the loss can be interpreted as a stochastic process. Since we have 10,000 realizations of the stochastic process, i.e., parameter and loss trajectory based on SGD, we present the results at different quantiles of the loss at each iteration .
SGD dynamics. Figure 8 shows the dynamics of the loss , the angle between two consecutive SGs , and the norm of the SGs for problems with different levels of difficulty, i.e., percentage of random labels in Gauss-10. For the simple problem with true labels (0% random) (Figure 8(a)), SGD converges fast in less than 50 iterations. The three quantities , and of all quantiles exhibit similar behavior, and decrease rapidly in the early iterations. Both and converge to zero, while reaches a steady state and oscillates around 0.25, indicating subsequent SGs are almost orthogonal to each other (the angle between subsequent SGs is greater than ).
The dynamics become more interesting with 15% random labels, the more difficult problem. In the initial phase, all quantiles of the loss and the angle between subsequent SGs drop sharply, similar to the 0% random label case. On the other hand, the gradient norms of all quantiles increase despite the slight drop for the and quantiles at the very beginning (see Figure 18 in the appendix for more details). Once the gradient norm peaks, and hits a valley, SGD enters a convergence phase. At this late phase, the gradient norm shows a steady decrease, while grows again until it reaches a steady state and begins to oscillate around 0. The loss persistently reduces to 0, but the rate of change also declines. We also observe similar dynamics in both MNIST and CIFAR-10 (see Figures 21 and 20 in the Appendix).
Empirical loss dynamics. Let denote the stochastic loss difference, i.e.,
Figure 9 shows the empirical distributions of at the and quantiles of the loss at iteration respectively. In particular, we get the empirical distribution of the quantile for every iteration from 10,000 runs of SGD. Overall, the distributions of are roughly symmetric, and mainly contain two stages of change: in the first stage, the means of both the upper () and lower () quantile distributions move away from the red horizontal line where (Figure 22 (a): iteration 1 to 15, and (b): iteration 1 to 100) while the variance of grows for all quantiles (Figure 10 (a) and (b)). In the subsequent stage, the mean of both the upper () and lower () quantiles moves towards zero (Figure 22 (a): iteration 25 to 200, and (b): iteration 200 to 9999), while the variance of all quantiles shrinks significantly. As SGD converges, the mean of at all quantiles stays near zero, and the variance becomes very small.
5.2 Deviation Bounds for Loss Dynamics
We consider the following two types of SGD updates:
to be referred to as vanilla SGD in the sequel, and
to be referred to as preconditioned SGD with diagonal preconditioner matrix . Recall that here represents the SG of at iteration computed based on a mini-batch of samples. Notice that if we take to be , preconditioned SGD becomes vanilla SGD.
Assuming SGs follow a multi-variate distribution with mean and covariance , we can represent the SG as
is reasonable since in a high dimensional space, sampling from an isotropic Gaussian distribution is effectively the same as sampling from (a small annular ring around) the sphere with high probability. Let us denote the batch dependent second moment of the SG as
We show in theory that for vanilla SGD, our observations of the two-phase dynamics of conditioned on , i.e., the inter-quantile range and variance increasing first then decreasing, in Figure 9 can be characterized by the Hessian , the covariance and associated quantities introduced in Section 4. To proceed with our analysis, we make Assumption 1, and then present Theorem 1 characterizing a conditional expectation and large deviation bound for defined in (22).
Assumption 1 (Bounded Hessian)
Let , where is the step size in (8), and is a ball of radius in . There is an such that for all .
Assumption 1 is the so called local smoothness condition . From Figure 2, 3, and 4, the largest eigenvalues of will decrease after the first few iterations. Therefore it is reasonable to assume the spectral norm of to be bounded when is close to a point in SGD iterations.
Further, for all , we have
and is an absolute constant.
The proof of Theorem 1 can be found in Appendix D. At iteration , Theorem 1 tells us that the conditional distribution of stays in the interval with high probability, where the concentration depends on dynamic quantities , and related to SG covariance and expectation.
The interval depends on two key quantities: (1) the negative of the 2-norm of the full gradient (first moment of the SG), and (2) the trace of the second moment of the SG . The first term tends to push the mean downward, while the second term lifts the mean. When is less than zero and and are small, SGD will decrease the loss with high probability.
For constant step size , the dynamics of corresponds to the dynamics of and the dynamics of corresponds to the dynamics of . The eigenvalues of first increase then decrease (Figure 1: middle), and so does and . Therefore, the dynamics of the variance of follow a similar trend (Figure 10).
SGD is able to escape certain types of stationary point or local minima. Consider a scenario where reaches a stationary point of such that , but is not the local minima of all , i.e., , such that . Then we have but , and the deviation bound becomes
so that concentrates around in the current setting since is positive definite. Therefore will increase and escape such stationary point or local minima.
We give a detailed characterization of for two over-parameterized problems using Theorem 1: (a) high dimensional least squares and (b) high dimensional logistic regression. For both problems, SGD has two stages of change as discussed in Section 5.1.
5.2.1 High Dimensional Least Squares
Considering the least squares problem in Example 1, we have the following result:
Consider high dimensional least squares as in Example 1. Let us assume , , and , where is the minimum singular value. Choosing , for we have
where , and is a positive constant.
Note that Corollary 1 presents a one-sided version of the concentration, but focuses on the side of interest, which characterizes the lower side or decrease in . From Corollary 1, SGD for high dimensional least squares has two phases. Early on in the iterations, will be much smaller than zero, thus SGD can sharply decrease . are large since is not close to and has large eigenvalues. Therefore the probability density of will spread out over its range, and facilitates exploration. In later iterations, both and are small, therefore SGD will help with the loss approaching the global minima with a sharp concentration.
5.2.2 High Dimensional Logistic Regression
For the binary logistic regression problem in Example 2, we have:
Consider high dimensional logistic regression given by (19). Let us assume , , and , where is the minimum singular value. If we choose , we have