1 Introduction
Large neural networks are surprisingly easy to optimize [46], despite the substantial nonconvexity of the loss as a function of the parameters [19]. In particular, it is usually found that changing the random initialization has no effect on performance, even though it can change the model learned by gradientbased optimization methods [17]
. Understanding the cause of trainability from random initial conditions is critical for the development of new architectures and optimization methods, which must otherwise just hope to retain this favorable property based on heuristics.
One possible explanation for this phenomenon is based on the stationary points of gradientbased optimization methods. These methods are stationary when the gradient of the loss function is 0, at the critical points
of the loss. Critical points are classified by their Morse index, or the degree of local negative curvature (i.e., the relative number of dimensions in parameter space along which the curvature is negative). Since, among all critical points, gradient descent methods only converge to those points with index 0
[29], which includes local minima, it has been argued that large neural networks are easy to train because their loss functions for many problems only have local minima at values of the loss close to or at the global optimum. This is known as the “nobadlocalminima” property. Previous work [11, 39] has reported numerical evidence for a convex relationship between index and loss that supports the hypothesis that neural network loss functions have the nobadlocalminima property: for low values of the loss, only low values of the index were observed, whereas for high values of the loss, only high values of the index were observed. However, more recent theoretical work has indicated that there are in fact bad local minima on neural network losses in almost all cases [12].The validity of the numerical results depends on the validity of the critical pointfinding algorithms, and the secondorder critical pointfinding algorithms used in [11] and [39] are not in fact guaranteed to find critical points in the case where the Hessian is singular. In this case, the secondorder information used by these critical pointfinding methods becomes unreliable.
Neural network loss Hessians are typically highly singular [44], and poor behavior of Newtontype critical pointfinding methods has been reported in the neural network case [10], casting doubt on the completeness and accuracy of the results in [11] and [39]. [16] verified that secondorder methods can in fact find highquality approximate critical points for linear neural networks, for which the analytical form of the critical points is known [2], providing ground truth. In particular, the two phase convergence pattern predicted by the classical analysis of Newton methods [36] is evident: a linear phase followed a short, local supralinear phase (Fig. 1A). The supralinear convergence is visible in the “cliffs” in the blue traces in Fig. 1A, where the convergence rate suddenly improves. With a sufficiently strict cutoff on the gradient norms, the correct lossindex relationship obtained analytically (Fig. 1B, gray points) is shared by the points obtained numerically (Fig. 1B, light blue points). With an insufficiently strict cutoff, the lossindex relationship implied by the observed points is far from the truth (Fig. 1B, dark red points)
Unfortunately, good performance on linear networks does not guarantee good performance on nonlinear networks. When applied to a nonlinear network, even with the same data, the behavior of these Newton methods changes dramatically for the worse (Fig. 1C). No runs exhibit supralinear convergence and the gradient norms at termination are many orders of magnitude larger. These are not the signatures of a method converging to a critical point, even though gradient norms are sometimes still under the thresholds reported in [39] and [16] (no threshold reported in [11]). This makes it difficult to determine whether the putative lossindex relationship measured from these critical points (Fig. 1D) accurately reflects the lossindex relationship at the true critical points of the loss function.
In this paper, we identify a major cause of this failure for secondorder critical pointfinding methods: gradientflat regions, where the gradient is approximately in the kernel of the Hessian. In these regions, the loss function is locally approximately linear along the direction of the gradient, whether or not the gradient is itself small, as would be the case near a true critical point. We first define gradientflatness (Sec. 2.1) and explain, with a lowdimensional example (Sec. 2.2), why it is problematic for secondorder methods: gradientflat points can be “bad local minima” for the problem of finding critical points. We then provide evidence that gradientflat regions are encountered when applying the NewtonMR algorithm to a deep neural network loss (Sec. 3, Appendix A.5). Furthermore, we show that, though gradientflat regions need not contain actual critical points, the lossindex relationship looks strikingly similar to that reported in [11] and [39], suggesting that these previous studies may have found gradientflat regions, not critical points. Finally, we note the implications of gradientflatness for the design of secondorder methods for use in optimizing neural networks: in the presence of gradientflatness, approximate secondorder methods, like KFAC [32] and Adam [26] may be preferable to exact secondorder methods even without taking computational cost into account.
2 GradientFlat Points are Stationary Points for SecondOrder Methods
In this section, we introduce and define gradientflat points and explain why they are problematic for secondorder critical pointfinding methods, with the help of a lowdimensional example to build intuition. In numerical settings and in high dimensions, approximately gradientflat points are also important, and so we define a quantitative index of gradientflatness based on the residual norm of the Newton update. Connected sets of these numerically gradientflat points are gradientflat regions, which cause trouble for secondorder critical pointfinding methods.
2.1 At GradientFlat Points, the Gradient Lies in the Hessian’s Kernel
Critical points are of interest because they are points where the firstorder approximation of a function at a point^{3}^{3}3Note that, for a neural network loss function, the variable we take the gradient with respect to, here
, is the vector of parameters,
, not the data, which is often denoted with an . based on the local information at(1) 
is constant, indicating that they are the stationary points of firstorder optimization algorithms like gradient descent and its accelerated variants. By stationary point, we mean a point at which the proposed updates of an optimization algorithm are zero. Stationary points need not be points to which the algorithm converges. For example, gradient descent almost surely only converges to critical points with index 0 [29, 23], even though its stationary points include saddle points and maxima. However, the curvature properties of nonminimal stationary points show up in proofs of the convergence rates of first order optimizers, e.g. [23, 25, 24], and so knowledge of their properties can guide algorithm design.
In searching for critical points, it is common to use a linear approximation to the behavior of the gradient at a point given the local information at a point
(2) 
Because these methods rely on a quadratic approximation of the original function , represented by the Hessian matrix of second partial derivatives, we call them secondorder critical pointfinding methods.
The approximation on the righthand side is constant whenever is an element of , where is notation for the kernel of a matrix — the subspace that maps to 0. When is nonsingular, this is only satisfied when is , so if we can define an update rule such that iff , then, for nonsingular Hessians, we can be sure that our method is stationary only at critical points.
In a Newtontype method, we achieve this by selecting our step by solving for the zeroes of this linear approximation, i.e. the Newton system,
which has solution
where the matrix is the MoorePenrose pseudoinverse of the matrix
, obtained by performing the singular value decomposition, inverting the nonzero singular values, and recomposing the SVD matrices in reverse order. The Newton update
is zero iff is 0 for a nonsingular Hessian, for which the pseudoinverse is simply the inverse. For a singular Hessian, the update is zero iff is in the kernel of the pseudoinverse. Note that if the Hessian is constant as a function of , the linear model of the gradient is exact and this algorithm converges in a single step.Within the vicinity of a critical point, this algorithm converges extremely quickly [36], but the guarantee of convergence is strictly local. Practical Newton methods in both convex optimization [5] and nonlinear equation solving [36, 22] often compare multiple possible choices of and select the best one according to a “merit function” applied to the gradients which has a global minimum for each critical point. Such algorithms have broader guarantees of global convergence. A common choice for merit function is the squared norm,
In gradient norm minimization [33], we optimize this merit function directly. The gradients of this method are
As with Newton methods, in the invertible case the updates are zero iff is 0. In the singular case, the updates are zero if the gradient is in the Hessian’s kernel. Because this method is framed as the minimization of a scalar function, it is compatible with firstorder optimization methods, which are more commonly implemented and better supported in neural network libraries.
However, neural network Hessians are generally singular, especially in the overparameterized case [44, 18], meaning the kernel is nontrivial, and so neither class of methods can guarantee convergence to critical points. Newton’s method can diverge, oscillate, or behave chaotically [20]. The addition of merit functionbased upgrades can remove these behaviors, but it cannot guarantee convergence to critical points [41, 20]. The gradient norm minimization method described in [39] was previously proposed and this flaw pointed out twice in the field of chemical physics — once in the 1970s (proposed [33], critiqued [8]) and again in the 2000s (proposed simultaneously [1, 6], critiqued [13]).
What are the stationary points, besides critical points, for these two method classes in the case of singular Hessians? It would seem at first that they are different: for gradient norm minimization, when the gradient is in the Hessian’s kernel; for Newtontype methods, when the gradient is in the Hessian’s pseudoinverse’s kernel. In fact, however, these conditions are identical, due to the Hessian’s symmetry^{4}^{4}4 Indeed, the kernel of the pseudoinverse is equal to the kernel of transpose, as can be seen from the singular value decomposition, and the Hessian is equal to its transpose because it is symmetric. See [45]., and so both algorithms share a broad class of stationary points.
These stationary points have been identified previously, but nomenclature is not standard: Doye and Wales, studying gradient norm minimization, call them nonstationary points [13], since they are nonstationary with respect to the function , while Byrd et al., studying Newton methods, call them stationary points [7], since they are stationary with respect to the merit function . To avoid confusion between these incommensurate conventions or with the stationary points of the function , we call a point where the gradient lies in the kernel of the Hessian a gradientflat point. This name was chosen because a function is flat when its Hessian is 0, meaning every direction is in the kernel, and so it is locally flat around a point in a given direction whenever that direction is in the kernel of the Hessian at that point. Note that, because for all matrices, every critical point is also a gradientflat point, but the reverse is not true. When we wish to explicitly refer to gradientflat points which are not critical points, we will call them strict gradientflat points. At a strict gradientflat point, the function is, along the direction of the gradient, locally linear up to second order.
There is an alternative view of gradientflat points based on the squared gradient norm merit function. All gradientflat points are stationary points of the gradient norm, which may in principle be local minima, maxima, or saddles, while the global minima of the gradient norm are critical points. When they are local minima of the gradient norm, they can be targets of convergence for methods that use firstorder approximations of the gradient map, as in gradient norm minimization and in Newtontype methods. Strict gradientflat points, then, can be “bad local minima” of the gradient norm, and therefore prevent the convergence of secondorder rootfinding methods to critical points, just as bad local minima of the loss function can prevent convergence of firstorder optimization methods to global optima.
Note that Newton methods cannot be demonstrated to converge only to gradientflat points [41]. Furthermore, Newton convergence can be substantially slowed when even a small fraction of the gradient is in the kernel [20]. Below we will see that, while a Newton method applied to a neural network loss sometimes converges to and almost always encounters strict gradientflat points, the final iterate is not always either a strict gradientflat point or a critical point.
2.2 Convergence to GradientFlat Points Occurs in a LowDimensional Quartic Example
The difficulties that gradientflat points pose for Newton methods can be demonstrated with a polynomial example in two dimensions, plotted in Fig. 2A. Below, we will characterize the strict gradientflat (orange) and critical (blue) points of this function (Fig. 2A). Then, we will observe the behavior of a practical Newton method applied to it (Fig. 2BC) and note similarities to the results in Fig. 1. We will use this simple, lowdimensional example to demonstrate principles useful for understanding the results of applying secondorder critical pointfinding methods to more complex, higherdimensional neural network losses.
As our model function, we choose
(3) 
It is plotted in Fig. 2A, middle panel. This quartic function has two affine subspaces of points with nontrivial Hessian kernel, defined by . The kernel points along the direction and so is orthogonal to this affine subspace at every point. As a function of , is convex, with onedimensional minimizers at . The strict gradientflat points occur at the intersections of these two sets: one strict gradientflat point at , which is a local minimum of the gradient norm, and one at , which is a saddle of the same (Fig. 2A, orange points, all panels). In the vicinity of these points, the gradient is, to first order, constant along the axis, and so the function is locally linear or flat. These points are gradientflat but neither is a critical point of . The only critical point is located at the minimum of the polynomial, at (Fig. 2A, blue point, all panels), which is also a global minimum of the gradient norm. The affine subspace that passes through divides the space into two basins of attraction, loosely defined, for secondorder methods: one, with initial coordinate , for the critical point of and the other for the strict gradientflat point. Note that the vector field in the middle panel shows update directions for the pure Newton method, which can behave extremely poorly in the vicinity of singularities [41, 20], often oscillating and converging very slowly or diverging.
Practical Newton methods use techniques like damping and line search to improve behavior [22]. To determine how a practical Newton method behaves on this function, we focus on the case of NewtonMR [43], which uses the MRQLP [9] solver^{5}^{5}5MRQLP, short for MINRESQLP, is a Krylov subspace method akin to conjugate gradient but specialized to the symmetric, indefinite and illconditioned case, which makes it wellsuited to this problem and to neural network losses. to compute the Newton update and backtracking line search with the squared gradient norm merit function to select the step size. Pseudocode for this algorithm is provided in Appendix A.3. This method was found to perform better than a damped Newton method and gradient norm minimization on finding the critical points of a linear autoencoder in [16]. Results are qualitatively similar for damped Newton methods with a squared gradient norm merit function.
The results of applying NewtonMR to Eqn. 3 are shown in Fig. 2B. The gradientflat point is attracting for some trajectories (orange), while the critical point is attracting for others (blue). For trajectories that approach the strict gradientflat point, the gradient norm does not converge to 0, but converges to a nonzero value near 10 (orange trajectories; Fig. 2
B, bottom panel). This value is typically several orders of magnitude lower than the initial point, and so would appear to be close to 0 on a linear scale that includes the gradient norm of the initial point. Since logscaling of loss functions is uncommon in machine learning, as losses do not always have minima at 0, secondorder methods apporaching gradientflat points can appear to converge to critical points if typical methods for visually assessing convergence are used.
There are two interesting and atypical behaviors worth noting. First, the trajectories tend to oscillate in the vicinity of the gradientflat point and converge more slowly (Fig. 2B, middle panel, orange lines). Updates from points close to the affine subspace where the Hessian has a kernel, and so which have an approximate kernel themselves, sometimes jump to points where the Hessian doesn’t have an approximate kernel. This suggests that, when converging towards a gradientflat point, the degree of flatness will change iteration by iteration. Second, some trajectories begin in the nominal basin of attraction of the gradientflat point but converge to the critical point (Fig. 2B, middle panel, blue points with coordinate ). This is because the combination of backtracking line search and large proposed step sizes means that occasionally, very large steps can be taken, based on nonlocal features of the function. Indeed, backtracking line search is a limited form of global optimization and the ability of line searches to change convergence behaviors predicted from local properties on nonconvex problems is known [36]. Since the backtracking line search is based on the gradient norm, the basin of attraction for the true critical point, which has a lower gradient norm than the gradientflat point, is much enlarged relative to that for the gradientflat point. This suggests that Newton methods using the gradient norm merit function will be biased towards finding gradientflat points that also have low gradient norm.
2.3 Approximate GradientFlat Points and GradientFlat Regions
Analytical arguments focus on exactly gradientflat points, where the Hessian has an exact kernel and the gradient is entirely within it. In numerical settings, it is almost certain no matrix will have an exact kernel, due to rounding error. For the same reason, the computed gradient vector will generically not lie entirely within the exact or approximate kernel. However, numerical implementations of secondorder methods will struggle even when there is no exact kernel or when the gradient is only partly in it, and so a numerical index of flatness is required. This is analogous to the requirement to specify a tolerance for the norm of the gradient when deciding whether to consider a point an approximate critical point or not.
We quantify the degree of gradientflatness of a point by means of the relative residual norm () and the relative cokernel residual norm () for the Newton update direction . The vector is an inexact solution to the Newton system , where and are the current iterate’s Hessian and gradient. The residual is equal to , and the smaller its norm, the better is as a solution. The cokernel residual is equal to the Hessian times the residual, and so ignores any component in the kernel of the Hessian. Its norm quantifies the quality of an inexact Newton solution in the case that the gradient lies partly in the Hessian kernel, the unsatisfiable case, where for any . When the residual is large but the cokernel residual is small (norms near 1 and 0, respectively, following suitable normalization), then we are at a point where the gradient is almost entirely in the kernel of the Hessian: an approximate gradientflat point. In the results below, we consider a point approximately gradientflat when the value of is below 5e4 while the value of is above . See Appendix A.4 for definitions and details. We emphasize that numerical issues for secondorder methods can arise even when the degree of gradientflatness is small.
Under this relaxed definition of gradientflatness, there will be a neighborhood of approximate gradientflat points around a strict, exact gradientflat point for functions with Lipschitzsmooth gradients and Hessians. Furthermore, there might be connected sets of nonnull Lebesgue measure which all satisfy the approximate gradientflatness condition but none of which satisfy the exact gradientflatness condition. We call both of these gradientflat regions.
There are multiple reasonable numerical indices of flatness besides the definition above. For example, the Hessiangradient regularity condition in [43], which is used to prove convergence of NewtonMR, would suggest creating a basis for the approximate kernel of the Hessian and projecting the gradient onto it. Alternatively, one could compute the Rayleigh quotient of the gradient with respect to the Hessian. Our method has the advantage of being computed as part of the NewtonMR algorithm. It furthermore avoids diagonalizing the Hessian or the specification of an arbitrary eigenvalue cutoff. The Rayleigh quotient can be computed with only one Hessianvector product, plus several vectorvector products, so it might be a superior choice for larger problems where computing a highquality inexact Newton step is computationally infeasible.
3 GradientFlat Regions are Common on Deep Network Losses
To determine whether gradientflat regions are responsible for the poor behavior of Newton methods on deep neural network (DNN) losses demonstrated in Fig. 1, we applied NewtonMR to the loss of a small, two hidden layer fullyconnected autoencoder trained on 10k MNIST images downsized to 4x4, similar to the downsized datasets in [11, 39]. We found similar results on a fullyconnected classifier trained on the same MNIST images via the crossentropy loss (see Appendix A.5) and another classifier trained on a very small subset of 50 randomlylabeled MNIST images, as in [47] (see Appendix A.6). We focused on NewtonMR because we found that a damped Newton method like that in [11] performed poorly, as reported for the XOR problem in [10], and furthermore that there was insufficient detail to replicate [11] exactly. We denote the network losses by and the parameters by . See Appendix A.1 for details on the networks and datasets and Appendix A.2 for details on the critical pointfinding experiments.
Gradient norms for the first 100 iterations appear in Fig. 3A. As in the nonlinear autoencoder applied to the multivariate Gaussian data (Fig. 1C), we found that, after 500 iterations, all of the runs had squared gradient norms over 10 orders of magnitude greater than the typical values observed after convergence in the linear case (¡1e30, Fig. 1A). 14% of runs terminated with squared gradient norm below the cutoff in [16] and so found likely critical points (blue). Twice as many runs terminated above that cutoff but terminated in a gradientflat region (28%, orange), while the remainder were above the cutoff but were not in a gradientflat region at the final iteration (black).
The relative residual norm for the Newton solution, , is an index of gradientflatness; see Sec. 2.3 and Appendix A.4 for details. The values of for every iteration of NewtonMR are shown for three representative traces in Fig. 3B. In the top trace, is close to 0, indicating that the iterates are not in a gradientflat region (, black). Newton methods can be substantially slowed when even a small fraction of the gradient is in the kernel [20] and can converge to points that are not gradientflat [7]. By contrast, in the middle trace (orange), the value of approaches , indicating that almost the entirety of the gradient is in the kernel. This run terminated in a gradientflat region, at effectively an exactly gradientflat point. Further, the squared gradient norm at 500 iterations, 2e5, is five orders of magnitude higher than the cutoff, 1e10. This is smaller than the minimum observed during optimization of this loss (squared gradient norms between 1e4 and 5e1), indicating the presence of noncritical gradientflat regions with very low gradient norm. Critical pointfinding methods that disqualify points on the basis of their norm will both converge to and accept these points, even though they need not be near true critical points. In the bottom trace (blue), the behavior of is the same, while the gradient norm drops much lower, to 3e13, suggesting convergence to a gradientflat region around a critical point that has an approximately singular Hessian.
Not all traces exhibit such simple behavior for the value of . In many traces, the value of oscillates from values close to 1 to middling values, indicating that the algorithm is bouncing in and out of one or more gradientflat regions (see Appendix A.5 for examples, on a classifier). This can occur when the final target of convergence given infinite iterations is a gradientflat point, as in the example in Sec. 2.2.
We found that 99 of 100 traces included a point where at least half of the gradient was in the kernel, according to our residual measure, while 89% of traces included a point that had a residual greater than , and 50% included a point with (Fig. 3C, bottom). This demonstrates that there are many regions of substantive gradientflatness, in which secondorder critical pointfinding methods could be substantively slowed.
The original purpose of applying these critical pointfinding methods was to determine whether the nobadlocalminima property held for this loss function, and more broadly to characterize the relationship at the critical points between the loss and the local curvature, summarized via the Morse index. If we look at either the points found after 500 iterations (results not shown; see Appendix A.5 for an example on a classifier) or the iterates with the highest gradientflatness (Fig. 3D), we find that the qualitative features of the lossindex relationship reported in [11] and [39] are recreated: convex shape, small spread at low index that increases for higher index, no minima or nearminima at high values of the loss. However, our analysis suggests that the majority of these points are not critical points but either strict gradientflat points (orange) or simply points of spurious or incomplete Newton convergence (black). The approximately critical points we do see (blue) have a very different lossindex relationship: their loss is equal to the loss of a network with all parameters set to 0 and their index is low, but not 0.
4 Discussion
The results above and in the appendices demonstrate that gradientflat regions, where the gradient is nearly in the approximate kernel of the Hessian, are a prevalent feature of some protoypical neural network loss surfaces and that many critical pointfinding methods are attracted to them. The networks used in this paper are very small, relative to practical networks for image recognition and natural language processing, which have several orders of magnitude more parameters. However, increasing parameter count tends to increase the singularity of loss Hessians
[44], and so we expect there to be even greater gradientflatness for larger networks.The strategy of using gradient norm cutoffs to determine whether a point is near enough to a critical point for the loss and index to match the true value is natural, but in the absence of guarantees on the smoothness of the behavior of the Hessian (and its spectrum) around the critical point, the numerical value sufficient to guarantee correctness is unclear. Our observations of gradientflat regions at extremely low gradient norm and the separation of these values, in terms of lossindex relationship, from the bulk of the observations suggest that there may be spurious targets of convergence for critical pointfinding methods even at such low gradient norm. Alternatively, they may in fact be near real critical points, and so indicate that the simple, convex picture of lossindex relationship painted by the numerical results in [11] and [39] is incomplete. Indeed, recent analytical results have demonstrated that bad local minima do exist for almost all neural network architectures and datasets (see [12] for a helpful table of positive and negative theoretical results regarding local minima). Furthermore, our observation of singular Hessians at low gradient norm suggests that some approximate saddle points of neural network losses may be degenerate (as defined in [23]) and nonstrict (as defined in [29]), which indicates that gradient descent may be attracted to these points, according to the analyses in [23] and [29]. These points need not be local minima. However, in two cases we observe the lowestindex saddles at low values of the loss (see Fig. 3, Fig. A.1) and so these analyses still predict that gradient descent will successfully reduce the loss, even if it doesn’t find a local minimum. In the third case, an overparameterized network Fig. A.2, we do observe a bad local minimum, as predicted in [12] for networks capable of achieving 0 training error.
Our results motivate a revisiting of the numerical results in [11] and [39]. Looking back at Figure 4 of [11], we see that their nonconvex Newton method, a secondorder optimization algorithm designed to avoid saddle points by reversing the Newton update along directions of negative curvature, appears to terminate at a gradient norm of order 1. This is only a single order of magnitude lower than what was observed during training. It is likely that this point was either in a gradientflat region or otherwise had sufficient gradient norm in the Hessian kernel to slow the progress of their algorithm. This observation demonstrates that secondorder methods designed for optimization, which use the loss as a merit function, rather than norms of the gradient, can terminate in gradientflat regions. In this case, the merit function encourages convergence to points where the loss, rather than the gradient norm, is small, but it still cannot guarantee convergence to a critical point. The authors of [11] do not report a gradient norm cutoff, among other details needed to recreate their critical pointfinding experiments, so it is unclear to which kind of points they converged. If, however, the norms are as large as those of the targets of their nonconvex Newton method, in accordance with our experience with damped Newton methods and that of [10], then the lossindex relationships reported in their Figure 1 are likely to be for gradientflat points, rather than critical points.
The authors of [39] do report a squared gradient norm cutoff of 1e6. This cutoff is right in the middle of the bulk of values we observed, and which we labeled gradientflat regions and points of spurious convergence, based on the cutoff in [16]
, which separates a small fraction of runs from this bulk. This suggests that some of their putative critical points were gradient flat points. Their Figure 6 shows a disagreement between their predictions for the index, based on a lossweighted mixture of a Wishart and Wigner random matrix, and their observations. We speculate that some of this gap is due to their method recovering approximate gradientflat points rather than critical points.
Even in the face of results indicating the existence of bad local minima [12]
, it remains possible that bad local minima of the loss are avoided by initialization and optimization strategies. For example ReLU networks suffer from bad local minima when one layer’s activations are all
, or when the biases are initialized at too small of a value [21], but careful initialization and training can avoid the issue. Our results do not directly invalidate this hypothesis, but they do call the supporting numerical evidence into question. Our observation of gradientflat regions on almost every single run suggests that, while critical points are hard to find and may even be rare, regions where gradient norm is extremely small are neither. For nonsmooth losses, e.g. those of ReLU networks or networks with maxpooling, whose loss gradients can have discontinuities, critical points need not exist, but gradientflat regions may. Indeed, in some cases, the only differentiable minima in ReLU networks are also flat
[27].Other types of critical pointfinding methods are not necessarily attracted to gradientflat regions, in particular Newton homotopy methods (first used on neural networks in the 90s [10], then revived in the 2010s [3, 35]), which are popular in algebraic geometry [4]. However, singular Hessians still cause issues: for a singular Hessian , the curve to be continued by the homotopy becomes a manifold with dimension 1 + , and orientation becomes more difficult. This can be avoided by removing the singularity of the Hessian, e.g. by the randomlyweighted regularization method in [34]. However, while these techniques may make it possible to find critical points, they fundamentally alter the loss surface, limiting their utility in drawing conclusions about other features of the loss. In particular, in the time since the initial resurgence of interest in the curvature properties of neural network losses sparked by [11], the importance of overparameterization for optimization of and generalization by neural networks has been identified [30, 40]. Large overparameterized networks have more singular Hessians [44], and so the difference between the original loss and an altered version with an invertible Hessian is greater. Furthermore the prevalence of gradientflat regions should be greater, since the Hessian kernel covers an increasingly large subspace.
The authors of [44] emphasize that when the Hessian is singular everywhere, the notion of a basin of attraction is misleading, since targets of convergence form connected manifolds and some assumptions in theorems guaranteeing firstorder convergence become invalid [23], though with sufficient, if unrealistic, overparameterization convergence can be proven [14]. They speculate that a better approach to understanding the behavior of optimizers focuses on their exploration of the sublevel sets of the loss. Our results corroborate that speculation and further indicate that this flatness means using secondorder methods to try to accelerate exploration of these regions in search of minimizers is likely to fail: the alignment of the gradient with the Hessian’s approximate kernel will tend to produce extremely large steps, for some methods, or no acceleration and even convergence to nonminimizers, for others.
Our observation of ubiquitous gradientflatness further provides an alternative explanation for the success and popularity of approximate secondorder optimizers for neural networks, like KFAC [32], which uses a layerwise approximation to the Hessian. These methods are typically motivated by appeals to the computational cost of even Hessianfree exact secondorder methods and their brittleness in the stochastic (nonbatch) setting. However, exact secondorder methods are only justified when the secondorder model is good, and at an exact gradientflat point, the secondorder model can be infinitely bad, in a sense, along the direction of the gradient. Approximations need not share this property. Even more extreme approximations, like the diagonal approximations in the adaptive gradient family (e.g. AdaGrad [15], Adam [26]), behave very reasonably in gradientflat regions: they smoothly scale up the gradient in the directions in which it is small and changing slowly, without making a quadratic model that is optimal in a local sense but poor in a global sense.
Overall, our results underscore the difficulty of searching for critical points of singular nonconvex functions, including deep network loss functions, and shed new light on other numerical results in this field. In this setting, secondorder methods for finding critical points can fail badly, by converging to gradientflat points. This failure can be hard to detect unless it is specifically measured. Furthermore, gradientflat points are generally places where quadratic approximations become untrustworthy, and so our observations are of relevance for the design of exact and approximate secondorder optimization methods as well.
Acknowledgements
The authors would like to thank Yasaman Bahri, Jesse Livezey, Dhagash Mehta, Dylan Paiton, and Ryan Zarcone for useful discussions. Authors CF & AL were supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814. NW was supported by the Google PhD Fellowship. Author AL was supported by a National Institues of Health training grant, 5T32NS095939. MRD was supported in part by the U. S. Army Research Laboratory and the U. S. Army Research Office under contract W911NF1310390. KB was funded by a DOE/LBNL LDRD, ‘Deep Learning for Science’, (PI, Prabhat).
References
 [1] (2000) Saddles in the energy landscape probed by supercooled liquids. Physical Review Letters 85 (25), pp. 5356–5359. Cited by: §2.1.

[2]
(1989)
Neural networks and principal component analysis: learning from examples without local minima
. Neural Networks 2 (1), pp. 53 – 58. Cited by: §1.  [3] (2017) Energy landscapes for machine learning. Phys. Chem. Chem. Phys. 19, pp. 12585–12603. Cited by: §4.
 [4] (2013) Numerically solving polynomial systems with bertini (software, environments and tools). Society for Industrial and Applied Mathematics. External Links: ISBN 1611972698 Cited by: §4.
 [5] (2004) Convex optimization. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521833787 Cited by: §2.1.
 [6] (2000) Energy landscape of a LennardJones liquid: statistics of stationary points. Physical Review Letters 85 (25), pp. 5360–5363. Cited by: §2.1.
 [7] (200401) On the convergence of newton iterations to nonstationary points. Mathematical Programming 99 (1), pp. 127–148. External Links: Document Cited by: §2.1, §3.
 [8] (198109) On finding transition states. The Journal of Chemical Physics 75 (6), pp. 2800–2806. External Links: Document Cited by: §2.1.
 [9] (2011) MINRESQLP: a Krylov subspace method for indefinite or singular symmetric systems. SIAM Journal on Scientific Computing 33 (4), pp. 1810–1836. Cited by: §A.2, §A.2, §A.3, §A.4, §2.2.
 [10] (1997) 488 solutions to the XOR problem. In Advances in Neural Information Processing Systems 9, M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), pp. 410–416. Cited by: §1, §3, §4, §4.
 [11] (2014) Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. CoRR abs/1406.2572. Cited by: §A.2, §1, §1, §1, §1, §1, §3, §3, §4, §4, §4.
 [12] (2019) Suboptimal local minima exist for almost all overparameterized neural networks. External Links: arXiv:1911.01413 Cited by: §1, §4, §4.
 [13] (2002) Saddle points and dynamics of LennardJones clusters, solids, and supercooled liquids. The Journal of Chemical Physics 116 (9), pp. 3777–3788. Cited by: §2.1, §2.1.
 [14] (2019) Gradient descent provably optimizes overparameterized neural networks. In International Conference on Learning Representations, Cited by: §4.
 [15] (201107) Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (null), pp. 2121–2159. External Links: ISSN 15324435 Cited by: §4.
 [16] (2019) Numerically recovering the critical points of a deep linear autoencoder. External Links: arXiv:1901.10603 Cited by: §A.2, §A.2, Figure 1, §1, §1, §2.2, §3, §4.
 [17] (2018) Loss surfaces, mode connectivity, and fast ensembling of dnns. External Links: 1802.10026 Cited by: §1.
 [18] (2019) An investigation into neural net optimization via hessian eigenvalue density. External Links: 1901.10159 Cited by: §2.1.
 [19] (2014) Qualitatively characterizing neural network optimization problems. CoRR abs/1412.6544. External Links: 1412.6544 Cited by: §1.
 [20] (1983) Analysis of Newton’s method at irregular singularities. SIAM Journal on Numerical Analysis 20 (4), pp. 747–773. External Links: ISSN 00361429 Cited by: §2.1, §2.1, §2.2, §3.
 [21] (2020) Training twolayer relu networks with gradient descent is inconsistent. External Links: 2002.04861 Cited by: §4.
 [22] (2014) Newtontype methods for optimization and variational problems. Springer International Publishing. External Links: Document Cited by: §2.1, §2.2.
 [23] (2017) How to escape saddle points efficiently. CoRR abs/1703.00887. External Links: 1703.00887 Cited by: §2.1, §4, §4.
 [24] (2018) Minimizing nonconvex population risk from rough empirical risk. CoRR abs/1803.09357. External Links: 1803.09357 Cited by: §2.1.
 [25] (2018) Accelerated gradient descent escapes saddle points faster than gradient descent. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, pp. 1042–1085. Cited by: §2.1.
 [26] (2014) Adam: a method for stochastic optimization. External Links: arXiv:1412.6980 Cited by: §1, §4.
 [27] (2017) The multilinear structure of relu networks. External Links: 1712.10132 Cited by: §A.1.2, §4.
 [28] (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: §A.1.1.
 [29] (2016) Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, V. Feldman, A. Rakhlin, and O. Shamir (Eds.), Proceedings of Machine Learning Research, Vol. 49, Columbia University, New York, New York, USA, pp. 1246–1257. Cited by: §1, §2.1, §4.
 [30] (2018) On the benefit of width for neural networks: disappearance of bad basins. External Links: arXiv:1812.11039 Cited by: §4.
 [31] (201604) Modeling, inference and optimization with composable differentiable procedures. Ph.D. Thesis, Harvard University. Cited by: §A.1.2.
 [32] (2015) Optimizing neural networks with kroneckerfactored approximate curvature. External Links: arXiv:1503.05671 Cited by: §1, §4.
 [33] (1972) Structure of transition states in organic reactions. general theory and an application to the cyclobutenebutadiene isomerization using a semiempirical molecular orbital method. Journal of the American Chemical Society 94 (8), pp. 2625–2633. Cited by: §2.1, §2.1.
 [34] (2018) The loss surface of deep linear networks viewed through the algebraic geometry lens. External Links: arXiv:1810.07716 Cited by: §4.
 [35] (2018) Loss surface of XOR artificial neural networks. Physical Review E 97 (5). Cited by: §4.
 [36] (2006) Numerical optimization. 2. ed. edition, Springer series in operations research and financial engineering, Springer, New York, NY. External Links: ISBN 9780387303031 Cited by: §A.2, §1, §2.1, §2.2.
 [37] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §A.1.1.
 [38] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §A.1.1.
 [39] (2017) Geometry of neural network loss surfaces via random matrix theory. In International Conference on Learning Representations (ICLR), Cited by: §A.2, §A.2, §1, §1, §1, §1, §1, §2.1, §3, §3, §4, §4, §4.
 [40] (202002) Complexity control by gradient descent in deep networks. Nature Communications 11 (1). External Links: Document Cited by: §4.
 [41] (1970) A hybrid method for nonlinear equations. Numerical methods for nonlinear algebraic equations. Cited by: §2.1, §2.1, §2.2.

[42]
(2017)
Searching for activation functions
. External Links: arXiv:1710.05941 Cited by: §A.1.2, Figure 1.  [43] (2018) NewtonMR: Newton’s method without smoothness or convexity. External Links: arXiv:1810.00303 Cited by: §A.2, §A.3, Figure 2, §2.2, §2.3.
 [44] (2017) Empirical analysis of the Hessian of overparametrized neural networks. CoRR abs/1706.04454. External Links: 1706.04454 Cited by: §1, §2.1, §4, §4, §4.
 [45] (199311) The fundamental theorem of linear algebra. The American Mathematical Monthly 100 (9), pp. 848. External Links: Document Cited by: footnote 4.
 [46] (2019) Optimization for deep learning: theory and algorithms. External Links: arXiv:1912.08957 Cited by: §1.
 [47] (2016) Understanding deep learning requires rethinking generalization. CoRR abs/1611.03530. External Links: 1611.03530 Cited by: §A.6, §3.
Appendix A Appendices
a.1 Networks and Datasets
a.1.1 Datasets
For the experiments in Fig. 1, 10000 16dimensional Gaussian vectors with mean parameter 0 and diagonal covariance with linearlyspaced values between 1 and 16 were generated and then meancentered.
For the experiments in Fig. 3 and Fig. A.1, 10000 images from the MNIST dataset [28] were cropped to 20 x 20 and rescaled to 4 x 4 using PyTorch [37], then
scored. This was done for two reasons. Firstly, to improve the conditioning of the data covariance, which is very poor for MNIST due to low variance in the border pixels. Secondly, to reduce the number of parameters
in the network, as computing a highquality inexact Newton solution is. Nonlinear classification networks trained on this downsampled data could still obtain accuracies above 90%, better than the performance of logistic regression (
87%).For the experiments in Fig. A.2, 50 random images of 0s and 1s from the MNIST dataset were PCAdownsampled to 32 dimensions using sklearn [38]. This provided an alternative approach to improving the conditioning of the data covariance and reducing the parameter counts in the network. The labels for these images were then shuffled.
a.1.2 Networks
All networks, their optimization algorithms, and the critical pointfinding algorithms were defined in the autograd Python package [31]. For the experiments in Fig. 1, two networks were trained: a linear autoencoder with a single, fullyconnected hidden layer of 4 units and a deep nonlinear autoencoder with two fullyconnected hidden layers of 16 and 4 units with Swish [42]
activations. Performance of the critical pointfinding algorithms was even worse for networks with rectified linear units (results not shown) as reported by others (Pennington and Bahri, personal communication). Nonsmooth losses need not have gradients that smoothly approach 0 near local minimizers, so it is only sensible to apply critical pointfinding to smooth losses, see
[27]. The nonlinear autoencoder used regularization. Neither network had biases. All autoencoding networks were trained with mean squared error.For the experiments in Fig. 3, a fullyconnected autoencoder with two hidden layers of 8 and 16 units, with Swish activations and biases, was used. This network had no regularization.
For the experiments in Fig. A.1, a fullyconnected classifier with two hidden layers of 12 and 8 units, with Swish activations and biases, was used. This network had regularization, since the crossentropy loss with which it was trained can otherwise have critical points at infinity.
For the experiments in Fig. A.2, a fullyconnected classifier with two hidden layers of 32 and 4 units, with Swish activations, was used. This network had no biases. This network also used
regularization and was trained with the cross entropy loss. Networks were trained to nearperfect training performance: 48 to 50 correctlyclassified examples out of 50.
a.2 Critical PointFinding Experiments
For all critical pointfinding experiments, we followed the basic procedure pioneered in [11] and used in [39] and [16]. First, an optimization algorithm was used to train the network multiple times. For the results in Fig. 1, this algorithm was fullbatch gradient descent, while for the remainder of the results, this algorithm was fullbatch gradient descent with momentum (learning rates 0.1 in both cases; momentum 0.9 in the latter).
The parameter values produced during optimization were then used as starting positions for NewtonMR. Following [39] and based on the arguments in [16], we selected these initial points uniformly at random with respect to their loss value.
NewtonMR [43] computes an inexact Newton update with the MRQLP solver [9] and then performs backtracking line search based on the squared gradient norm. For pseudocode of the equivalent exact algorithm, see Appendix A.3
The MRQLP solver has the following hyperparameters for determining stopping behavior: maximum number of iterations (
maxit), maximum solution norm (maxxnorm), relative residual tolerance (rtol), condition number limit (acondlim). We set maxit to be equal to the number of parameters, since with exact arithmetic, this is sufficient to solve the Newton system. We found that the maxxnorm and acondlim parameters did not affect stopping behavior, which was driven by the tolerance rtol. For Fig. 1, we used a tolerance of 1e10. For Fig. 3 and Fig. A.1, we used a tolerance of 5e4, based on the values for the relative residuals found after maxit iterations on test points. See Appendix A.4 for details about the relative residual stopping criterion. We do not provide pseudocode for this algorithm; see [9].The backtracking line search has the following hyperparameters: , the starting step size, , the multiplicative factor by which the step size is reduced, and , the degree of improvement required to terminate the line search. We set these hyperparameters to , , and . Furthermore, in backtracking line search for Newton methods, it is important to always check a unit step size in order to attain supralinear convergence [36]. So before running the line search, we also check unit step size with a stricter .
a.3 NewtonMR Pseudocode
The pseudocode in Algorithm 1 defines an exact leastsquares Newton method with exact line search. To obtain the inexact NewtonMR algorithm, the double to determine the update direction should be approximately satisfied using the MRQLP solver [9] and the to determine the step size should be approximately satisfied using Armijotype backtracking line search. For details, see [43].
a.4 Relative Residual and Relative CoKernel Residual
The relative residual norm, , measures the size of the error of an approximate solution to the Newton system. Introducing the symbols and , for the Hessian and gradient at a query point, the Newton system may be written and is then
where of a matrix is its Frobenius norm. Since all quantities are nonnegative, is nonnegative; because the denominator bounds the numerator, by the triangle inequality and the compatibility of the Frobenius and Euclidean norms, is at most 1. For an exact solution of the Newton system , is 0, the minimum value, while is 1, the maximum value. Note that small values of do not imply large values of this quantity, since goes to 0 when a Newton method converges towards a critical point, while goes to 0.
When is partially in the kernel of , the Newton system is unsatisfiable, as will also be partly in the coimage of , the linear subspace into which cannot map any vector. In this case, the minimal value for will no longer be . The optimal solution for instead has the property that its residual is once restricted to the cokernel of , the linear subspace orthogonal to the kernel of . This cokernel residual can be measured by applying the matrix to the residual vector . After normalization, it becomes
Note that this value is also small when the gradient lies primarily along the eigenvalues of smallest magnitude On each internal iteration, MRQLP checks whether either of these values is below a tolerance level – in our experiments, 5e4 – and if either is, it ceases iteration. With exact arithmetic, either one or the other of these values should go to 0 within a finite number of iterations; with inexact arithmetic, they should just become small. See [9] for details. Less than 5% of Newton steps were obtained from the kernel residual going below the tolerance, indicating that almost all points of the loss surface had an approximately unsatisfiable Newton system.
a.5 Replication of Results from Sec. 3 on MNIST MLP
We repeated the experiments whose results are visualized in Fig. 3 on the loss surface of a fullyconnected classifier on the same modified MNIST dataset (details in Appendix A.1). We again found that the performance of the NewtonMR critical pointfinding algorithm was poor (Fig. A.1A) and that around 90% of runs encountered a point with gradientflatness above (Fig. A.1C, bottom row). However, we observed that fewer runs terminated at a gradientflat point (Fig. A.1C, top row), perhaps because the algorithm was bouncing in and out of gradientflat regions (Fig. A.1B, top and bottom rows), rather than because of another type of spurious Newton convergence. If we measure the lossindex relationship at these nongradientflat points (Fig. A.1D), we see the same pattern as in Fig. 3. This also holds if we look at the maximally flat points, as in Fig. 3D (results not shown).
a.6 Replication of Results from Sec. 3 on Binary MNIST Subset Memorization
We repeated the experiments whose results are visualized in Fig. 3 on the loss surface of a fullyconnected classifier on a small subset of 50 0s and 1s from the MNIST dataset (details in Appendix A.1). In this setting, the network is overparameterized, in that it has a hidden layer almost as wide as the number of points in the dataset (32 vs 50) and has more parameters than there are points in the dataset (1160 vs 50). It is also capable of achieving 100% accuracy on this training set, which has random labels, as in [47]. We again observe that the majority of runs do not terminate with squared gradient norm under 1e8 (33 out of 50 runs) and a similar fraction (31 out of 50 runs) encounter gradientflat points (Fig. A.2A and C, bottom panel). The lossindex relationship looks qualitatively different, as might be expected for a task with random labels. Notice the appearance of a bad local minimum: the blue point at index 0 and loss .