Finding flat minima solutions is important to many optimization problems, especially those found in machine learning applications. Such models generalize better than sharp minima because the value of the objective function remains similar for flat minima if the data is shifted, distorted, or otherwise changed. Thus, in practice, optimal machine learning models with flatter optima should perform better than those with sharper ones on test data that is distributed differently than the original training data.
Here, we define flat minima as those with a small spectral radius (largest absolute eigenvalue) of the Hessian of the objective function and sharp minima as those where the spectral radius is large. For minima where this spectral radius is small, there is no direction away from this point in which the objective function immediately and rapidly increases. Therefore, by regularizing our optimization of models with respect to this spectral radius, we are able to obtain solutions that are less susceptible to errors or biases in the training or test data.
However, this regularization presents certain challenges: For large neural networks, computing and storing the Hessian and the third derivative tensor are infeasible; therefore we need to develop methods to efficiently compute the regularization term and its gradient. We also needed to design methods to introduce errors and biases into the data in order to test the generalizability of these models.
To tackle these challenges, we build a method to regularize the spectral radius while computing Hessian-vector products. We present results with different regularization parameters to show our methodology is stable. We introduce multiple methods to test generalizability, tailored to specific problems. Our contributions include:
We develop an algorithm for regularizing neural networks with respect to the spectral radius of the Hessian, a novel use of a derivative measure for such regularization.
We apply and extend differential operators used to efficiently compute Hessian-vector products for neural networks.
We provide formal proofs on convergence and other properties of our algorithm.
We present experimental results on multiple real world data sets across different domains, designing specific methods to test generalizability.
In Section 2, we review existing literature related to our research. In Section 3, we derive the algorithm used for our regularization. In Section 4, we discuss convergence results and other properties of the algorithm. In Section 5, we describe different generalizability tests and present results of our experiments with regularization on various data sets.
2 Related Work
Existing works discussed how different learning methods affect the ability of neural networks to converge to flat minima. Keskar et al. 
observed that large-batch stochastic gradient descent (SGD) and its variants, such as adaptive moment estimation (Adam), tend to converge to sharp minima, while small-batch methods converge to flat minima. This implies that small-batch methods generalize better than large-batch methods, as the training function at sharp minima is more sensitive. Some possible causes for this include that large-batch methods over-fit, are attracted to saddle points, lack the explorative properties of small-batch methods (tend to converge to the minima close to the initial weights).Yao et al.  showed that large-batch training of neural networks converges to points with larger Hessian spectrum (both in terms of dominant and other eigenvalues) which shows poor robustness. Jastrzebski et al.  extended these claims in showing that a large learning rate also leads to flatter minima that generalize better than sharper minima.
Others suggested different ways to measure and find flat minima, including different objective functions and optimization algorithms. Ma et al.  suggests that Kronecker-Factored Approximate Curvature (K-FAC) [Martens and Grosse, 2015], an approximate second-order method may yield generalization improvements over first-order SGD. Chaudhari et al.  proposed an entropy-based objective function to find solutions in flat regions and an algorithm (called Entropy-SGD) to conduct the optimization. He et al.  observed that at local minima of deep networks, there exist many asymmetric directions where the loss sharply increases, which they call “asymmetric valleys.” They propose weight averaging along the SGD trajectory to bias solutions in favor of the flat side. Chaudhari et al.  also noted that many neural networks, trained on various data sets using SGD or Adam, converge to a point with a large number of near-zero eigenvalues, along with a long positive tail, and shorter negative tail. These observations about the imbalanced eigenspectrum by Chaudhari et al.  and He et al.  are of special interest to us, as our regularization method, which attempts to reduce the spectral radius of the Hessian, should be tailored to avoid these asymmetric valleys.
While Yoshida and Miyato  developed a spectral norm radius regularization method, it looks solely at the spectral radius of a neural network’s weight matrices, rather than the spectral radius of the Hessian of the loss function. While they experimentally show that their regularization method has a small generalization gap (between the training and test set), their method also had higher Hessian spectral radius than vanilla SGD, weight-decay, and adversarial methods. We believe our regularization method and generalization tests more directly address the task of finding flat minima and evaluating their generalizability.
|weights of neural network|
|spectral radius of|
|degree of regularization|
|eigenvector corresponding to spectral radius|
For reference, we provide a summary of the main variables used and their corresponding definitions in Table 1. We choose to express our problem as a regularized optimization problem, rather than a constrained optimization or min-max problem. Thus, our optimization problem is:
where weights , is a non-convex objective function, is the spectral radius of the Hessian of (i.e., the maximal absolute eigenvalue), and are regularization parameters. For convenience, we denote .
The caveat is: we cannot directly compute . For large neural networks of size , computing and storing objects of size (such as the Hessian) is infeasible. However, we can efficiently compute the Hessian-vector product for a given using methods discussed in Section 3.1.1.
Our goal is to design efficient algorithms for solving this minimization problem. In Section 3.1, we discuss how to compute the regularized term and its gradient. In Section 3.2, we present and explain different variants of our algorithm.
3.1 Gradients of Regularization Term
The spectral radius can be expressed , where is the eigenvector corresponding with the maximum absolute eigenvalue. In order to compute gradient update steps, we need to calculate .
For distinct eigenvalues of symmetric matrix ,
where is the eigenvector corresponding to eigenvalue .
Aa, van der et al.  proves Lemma 3.1. The expression for this derivative is more complicated with repeating eigenvalues, so we assume that the eigenvalue in question is distinct (in practice, this is almost surely the case).
Using this result and assumption, we express . Thus, if we can efficiently compute and for , we can respectively calculate and .
3.1.1 Hessian-Vector Operations
In order to compute and for large neural networks with , we extend Pearlmutter94fastexact operator. We define this differential operator:
Note that . Thus, by applying the differential operator to the forward and backwards passes used to calculate the gradient, we can compute efficiently.
We extend this to by computing and during the forward pass and , , and during the backward pass. Our formulas and the derivation can be found in Appendix A. Since , this allows us to efficiently compute .
These methods keep the number of stored values , while directly computing the Hessian and third-derivative tensor would require and storage (which for large networks is infeasible).
Here, we present two versions of our algorithm: an exact gradient descent algorithm (Algorithm 1) and a batch stochastic gradient descent algorithm (Algorithm 2). The first is ideal, but impractical; hence, we need a more practical stochastic algorithm. Here, for simplicity, we hide the dependencies (where is the value of weights at iteration ) for many of the variables by defining: , , , etc. Step size is some predefined function of iteration .
We start with Algorithm 1, a gradient descent algorithm with exact values for the eigenvectors and spectral radius. In practice, these values are often difficult to compute exactly due to computational and storage constraints. For our stochastic algorithm, we relax this exact constraint and compute or approximate values based on a batch (rather than the full data set). We examine the convergence properties of Algorithm 1 in Section 4.1.
Building off the exact gradient descent algorithm, we develop a more practical stochastic algorithm. Algorithm 2, uses batch stochastic gradient descent (rather than gradient descent) and power iteration to compute the eigenvector (rather than an exact method). Due to the implementation of and , the storage requirements are not extensive. Also, since the Hessian is symmetric, the power iteration converges at a rate proportional to the square of the ratio between the two largest eigenvalues , instead of the typical linear rate . The convergence properties of Algorithm 2 are studied in Section 4.2.
First, in Section 4.1, we prove that the exact gradient descent algorithm (Algorithm 1) converges to a critical point. Second, in Section 4.2, we show that the stochastic gradient descent algorithm (Algorithm 2) almost surely converges to a critical point.
4.1 Gradient Descent Convergence
Here, we show that Algorithm 1 converges to a critical point with some assumptions on objective function and learning rate . Specifically, we assume , , bounded from below, has Lipschitz gradient, and has Lipschitz continuous third derivative tensor. The learning rate is sufficiently small; specially, , where is finite
Given the above assumptions, Algorithm 1 converges to a critical point.
We use these assumptions, and Taylor’s theorem to show that decreases until it reaches a critical point. Since is bounded from below, this implies the algorithm converges. We provide more details in Appendix B.1.
4.2 Stochastic Gradient Descent Convergence
Here, we show that Algorithm 2 almost surely converges to a critical point, with some assumptions on objective function and learning rate .
Assume that 1) , , is bounded from below (without loss of generality, ); 2) typical conditions on the learning rate:
3) the second, third, and fourth moments of the update term do not grow too quickly (in reality, this is usually true):
where is the update term, an approximation of computed on a single sample or batch of samples; 4) and outside a certain horizon, the gradient points towards the origin:
There are well-known tricks to ensure this assumption, such as adding a small linear term (which does not affect the Hessian-part of our algorithm).
4.2.2 Confinement and Convergence
First, we show that given our assumptions, the iterates are confined.
We define a sequence that is a function of and show that the sum of its positive expectations is finite. Then, we apply the Quasi-Martingale Convergence Theorem and show that since the sequence converges almost surely, the norm of our weights is bounded. This also implies all continuous functions of are bounded. Next, using our assumptions and Lemma 4.1, we prove almost sure convergence.
We use confinement of to show that positive expected variations in between iterates are bounded by a constant times our learning rate squared. Using Assumption 2 and the Quasi-Martingale Convergence Theorem, we show that converges almost surely. Then, we show that almost surely converges to zero. Our proof is based off Bottou98on-linelearning proof that stochastic gradient descent almost surely convergences. We provide more detailed proofs of Lemma 4.1 and Theorem 4.2 in Appendix B.2.
4.2.3 Power Iteration
Here, we show that with certain additional assumptions, the power iteration and its convergence criteria fit our earlier assumptions for Algorithm 2 to converge almost surely. We start by showing that the power iteration fits our bounds on the moments of the update term. 1) Assume:
for , and where is the true eigenvector of . 2) We also assume that the Hessian is Lipschitz continuous. 3) The Power iteration algorithm converges to eigenvalues with the following condition:
where is the computed eigenvector. 4) We also assume as , as discussed later (see Lemma 4.3). 5) We define , and assume:
Given the above assumptions,
for some positive constants .
We split into components in terms of and the regularization term. We use our above assumptions to bound each of these components. Then, we combine the results to show the lemma holds. Lastly, we discuss constraints on , and show that must decrease to 0. We additionally assume is unbiased, i.e., and that as .
Given the above assumptions, , where is the true gradient.
We tested our spectral radius regularization algorithm on the following data sets: forest cover type [Blackard and Dean, 1999], United States Postal Service (USPS) handwritten digits [LeCun et al., 1990], and chest X-ray [Wang et al., 2017]. The forest cover type data uses cartographic data to predict which of seven tree species is contained in a plot of land. The USPS digits data includes images from scanned envelopes, with the goal to identify which digit 0-9 each image corresponds to. The chest X-ray data uses images of patients’ chest x-rays, and identifying which of fourteen lung diseases each patient has. We describe the data sets in further detail in Appendix C.1.
Additionally, we trained unregularized, ChaudhariCSL16 Entropy-SGD, MartensG15 K-FAC, and He2019 asymmetric valley models, which serve as baseline comparisons for our models.
In order to test if the models with lower spectral radius generalized better than those with higher spectral radius, we employed methods to create test sets that are different from the training set. These methods employ covariate shifts, image augmentation techniques, and introducing new, different data. For each model, we measured spectral radius , estimated on a random batch of the training set.
For the forest cover type data, we weighted the test subjects in order to shift the mean of features. Then, we compared the accuracy of our trained models, and repeated for a total of one thousand times. These perturbations could simulate test conditions with poor measurements and/or changes in climate.
For the USPS handwritten digits data, we augmented the test set using random crops and rotations. These crops and rotations could simulate test conditions where digits are written on angles or poorly scanned.
For the chest X-ray data, we compared performance on two similar data sets, CheXpert [Irvin et al., 2019] and MIMIC-CXR [Johnson et al., 2019] (using the six conditions common to the three data sets). We kept the labeled training and validation data sets separate for each data set, as there are differences in labeling. As this new chest X-ray data contains patients with conditions not present in the training data, this tests how well the trained models perform on a larger segment of the population.
We give a more detailed description of our methodology in Appendix C.2.
|Test||2-sided p-values for slope of accuracy vs. -norm of shifts|
We trained a feed forward neural network on the forest cover type data, using different regularization parametersand and compared the accuracy on the randomly shifted test set to that of unregularized, Entropy-SGD, K-FAC, and asymmetric valley models. Figure 1 shows that there is a benefit to the regularized models and comparison models, as the -norm of the shifts increases, over the unregularized model, but no significant benefit to the K-FAC model. However, there is variability in the different shifts, since all values are less than . While these plots help visualize our comparisons, Table 2 gives us a more detailed picture: while the unregularized model is more accurate on the unshifted test data, it performs significantly worse as the -norm of the shifts increases. On the other hand, the regularized, Entropy-SGD, and asymmetric valley models perform about the same, irregardless of these shifts. The slope of the trend line comparing the accuracy to the -norm of the shifts is not significantly different from 0 for the regularized models, but significantly negative for the unregularized model (using a p-value of 0.05). While there are some differences between the various regularized models – there is some delineation between those with lower and higher – all perform better than the unregularized model. The regularized models have a significantly larger slope compared to that of the unregularized model, and the order of the slopes (from high-to-low) follows the models’ regularization strictness. Additionally, our spectral radius measure largely follows the regulation strictness. In terms of the regression slope, our strictest regularized models with and small are the best and third-best performing model. Entropy-SGD is the second-best performing model, although the difference between it and the prior two models is insignificant.
|Model||Test||AT 1||AT 2|
We trained a convolutional neural network on the USPS data, with no regularization; Entropy-SGD, K-FAC, and asymmetric valley methods; and various values of regularization parametersand , comparing the accuracy on both the test and augmented test data sets. Per Table 3, while the models performed comparably on the test data (all models had an accuracy of 94.47-95.91%), our regularized models performed significantly better than the unregularized model on both augmented test data sets (87.54-90.93% vs. 85.50% on the Augmented Test 1; 66.17-69.61% vs. 63.43% on Augmented Test 2). On all test sets, the regularized model with and was the most accurate. Of our comparison models, the asymmetric valleys model performed best, only 1.5% and 2.6% worse than our best model on the augmented test sets. The Entropy-SGD model had a comparable accuracy to our regularized models on the Augmented Test 1 sets (only 1.9% worse than our best model), but it performed worse on the trickier Augmented Test 2 (6.5% worse than our best model, however, still better than the unregularized model). The K-FAC model performed worse on both augmented test sets: 5.3% worse than our best model on Augmented Test 1 and 11.0% worse on Augmented Test 2.
For the chest X-ray comparisons, we started with Zoogzog implementation of CheXNet [Rajpurkar et al., 2017]
, a 121-layer DenseNet trained on the chest X-ray data set as our baseline. Using this pre-trained model as the initialization, we trained for an additional epoch with our spectral radius regularization method, comparing the mean area under the curve (AUC) of the receiver of the receiver operating characteristic curve over the 14 classes (or 6 for our comparison tests). Similarly, we used this initialization and trained for one additional epoch using the Entropy-SGD and K-FAC optimization methods. The results are displayed in Table4. Since we measured the spectral radius on a random batch, our batch size of four was too small to get a meaningful estimate for these models. Also, our strictest regularized model () appeared to be too strictly regularized, as its test mean AUC is significantly worse than the other models. While the other regularized models performed worse in terms of test mean AUC, our regularized models performed as well or better on most of the CheXpert and MIMIC-CXR data sets, with the second-most strictly regularized model () performing the best on these four test sets. In fact, it outperformed the original model by 4.4% and 3.8%, respectively, on the CheXpert and MIMIC-CXR “validation” sets, and 0.6% and 2.1% on the “training” sets, implying the regularized model generalizes better than the original model. K-FAC slightly outperformed this model on the MIMIC-CXR training set (by 0.4%), however performed up to 4.0% worse on the other sets. The Entropy-SGD model performed 5.9-10.2% worse than our regularized model on the CheXpert and MIMIC-CXR sets.
We created algorithms for regularized optimization of machine learning models in order to find flat minima. Furthermore, we developed tools for calculating the regularization term and its gradient for neural networks. We proved that these methods converge (or almost surely do) to a critical point. Then, we showed that the regularization works on a range of applicable problems which required us to design different methods to test generalizability for each of these problems.
- Computation of eigenvalue and eigenvector derivatives for a general complex-valued eigensystem. Electronic Journal of Linear Algebra 16, pp. 300–314 (English). External Links: Cited by: §3.1.
- Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture vol.24, pp. 131–151. Cited by: §5.
- Entropy-sgd: biasing gradient descent into wide valleys. CoRR abs/1611.01838. External Links: Cited by: §2.
- Asymmetric valleys: beyond sharp and flat local minima. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 2553–2564. External Links: Cited by: §2.
- CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. CoRR abs/1901.07031. External Links: Cited by: §C.2, §5.1.
- Finding flatter minima with sgd. External Links: Cited by: §2.
- MIMIC-CXR: A large publicly available database of labeled chest radiographs. CoRR abs/1901.07042. External Links: Cited by: §C.2, §5.1.
On large-batch training for deep learning: generalization gap and sharp minima. CoRR abs/1609.04836. External Links: Cited by: §2.
Handwritten zip code recognition with multilayer networks.
Proceedings - International Conference on Pattern Recognition, Vol. 2, pp. 35–40 (English (US)). Cited by: §5.
- Inefficiency of K-FAC for large batch size training. CoRR abs/1903.06237. External Links: Cited by: §2.
- Optimizing neural networks with kronecker-factored approximate curvature. CoRR abs/1503.05671. External Links: Cited by: §2.
- Fast exact multiplication by the hessian. Neural Computation 6, pp. 147–160. Cited by: Appendix A.
- CheXNet: radiologist-level pneumonia detection on chest x-rays with deep learning. CoRR abs/1711.05225. External Links: Cited by: §5.2.
ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3462–3471. Cited by: §5.
- Hessian-based analysis of large batch training and robustness to adversaries. External Links: Cited by: §2.
- Spectral norm regularization for improving the generalizability of deep learning. External Links: Cited by: §2.
Appendix A Hessian-Vector Operations Derivation
The forward computation for each layer of a network with input , output , weights , activation , bias , error or objective measure , and direct derivative , is given by:
The backward computation:
Applying to forward pass:
The backwards computation follows as:
This yields the result found in Pearlmutter . However, we extend it one step further by applying again, i.e., applying to the original forward pass:
The backwards computation follows as:
The original formulation allows us to efficiently compute , which can be used to compute and/or estimate the eigenvector corresponding to the spectral radius (via Power iteration). While, the extended formulation allows us to efficiently compute and thus . This enables us to efficiently compute the gradient of our optimization problem for use in gradient descent methods.
Appendix B Analysis Proofs
b.1 Gradient Descent Convergence
Proof that the gradient descent algorithm converges (Theorem 4.1).
Since is Lipschitz continuous (i.e., ), . And similarly, since is Lipschitz continuous (i.e., ), . We define: .
Since our update step is , and by applying Taylor’s Theorem:
for some , where . Applying the inequality derived from Lipschitz continuity:
Given our assumption ,
Either or . If the first, then the algorithm has converged to a critical point. If the second, then this step will decrease the value of the objective function. Since the objective function is bounded from below, it cannot decrease in perpetuum. Thus, it must eventually converge to a critical point. ∎
b.2 Stochastic Gradient Descent Convergence
b.2.1 Confinement and Convergence
Proof of confinement (Lemma 4.1):
Definition B.1 implies that:
for . Note that this becomes an equality when .
Applying this to ,
By the Cauchy-Schwartz inequality,
Taking the expectation,
Due to Assumption 3, there exist positive constants such that
and thus there exist positive constants such that
If , then , then the first term on the right hand side is zero. And if , by Assumption 4, the first term of the right hand side is negative. Therefore,
We then transform the expectation inequality to
We define the sequences as follows:
Note that (this can be shown by writing and using the condition on the sum of the squared learning rate). By substituting these sequences into the above inequality, we obtain
By defining , for some process , we can bound the positive expected variations of :
Due to Assumption 2, the sum of this expectation is finite. The Quasi-Martingale Convergence Theorem states:
This implies that converges almost surely. And, since converges to , converges almost surely.
Assume converges to a value . For sufficiently large, . This implies that the above inequality is an equality, which then implies
But since and (for sufficiently large), this result is not compatible with Assumption 4. Because of this contradiction, we must conclude that converges to 0.
Since converges to 0, the norm and parameters are bounded. This also means that all continuous functions of are likewise bounded. ∎
Proof that SGD converges almost surely (Theorem 4.2):
We can bound variations of loss/cost criteria using a first order Taylor expansion and bounding the second derivatives with .
which can be rewritten as:
Taking the expectation,
Bounding the expectation , yields
Therefore, the positive expected variations are bounded by
By the Quasi-Martingale Convergence Theorem, converges almost surely,
Additionally, taking the above and summing on , implies the convergence of the following series:
We define , the variations of which are bounded using the Taylor expansion, similarly to the variations of :
for some constant . Taking the expectation and bounding the second derivative by , yields
The positive expectations are bounded,
Since the terms on the right hand side are summands of convergent infinite sequences (by the above and the Assumption 3), by the Quasi-Martingale Convergence Theorem, converges almost surely. And since the above sequence converges almost surely, this implies the limit must be zero:
b.2.2 Power Iteration
Proof that power iteration follows the assumed update steps (Lemma 4.2):
For ease, we define . We begin by splitting into its components:
By the definition of and the triangle inequality,
Taking the expectation and applying the Cauchy-Schwartz inequality yields:
Note . Similarly,