Neural networks have pushed forward the state of the art in a variety of machine learning tasks(Schmidhuber, 2014; LeCun et al., 2015)
, but little is known about precisely how they work or why they can be effectively optimized to solutions with good test set performance. In particular, it is not known why the non-convex training problem of a neural network is soluble using popular local methods such as stochastic gradient descent. One approach to answering this question involves direct numerical interrogation of the loss surfaces of neural networks. Efforts in this vein(Pascanu et al., 2014; Dauphin et al., 2014; Choromanska et al., 2014; Pennington & Bahri, 2017) have demonstrated cases in which these loss surfaces enjoy a favorable “no bad local minima” property (that is, all minima are located at low values of the loss), and contain a proliferation of saddles instead, which was believed to be problematic for first-order methods. Later theoretical work in non-convex optimization then showed that some stochastic first-order methods avoid saddle points of the training loss (Lee et al., 2016; Jin et al., 2017) and converge to minima that generalize to the test loss (Jin et al., 2018). Taken together, these claims would do much to explain the trainability of neural networks.
However, a thorough understanding of the trainability of neural networks remains to be achieved. In particular, much is unknown regarding the properties of the loss surface, on which we focus in this paper. The theoretical results cited above concerning the shape of the loss surface are asymptotic and apply to simplified, approximate models (Choromanska et al., 2014; Pennington & Bahri, 2017). The rates at which these asymptotic results apply and the degree to which these approximations break down as properties of the data and architecture are varied are unknown. Furthermore, recent theoretical results (Liang et al., 2018) have indicated that the loss surfaces for some common neural network architectures, e.g. those with ReLUs and without skip connections, suffer from poor local minima even for linearly separable data. Lastly, there is debate in the field on whether minima with different generalization performance exist and are characterized by their curvature properties (Hochreiter & Schmidhuber, 1997; Dinh et al., 2017; Yao et al., 2018). Taken together, all these results underscore the need for further study of the shape of the loss surface.
One way to determine to what extent and why neural network loss surfaces can suffer from bad local minima, and what effect the distribution of minima, maxima, and saddle points (generically, points with zero gradient norm, or critical points) have on optimization and generalization, is to perform a thorough empirical examination of the interplay between architecture, data, and initialization strategy as they affect the loss surface. In order to do so, robust numerical methods for finding critical points are needed.
Unfortunately, the problem of finding all of the critical points of a neural network loss surface is non-trivial. First, the rapid scaling of the number of critical points with input dimension (Baldi & Hornik, 1989; Auer et al., 1996), ignoring continuous equivalences (Freeman & Bruna, 2016), is such that finding all of them can be impossible for any but the smallest of architectures. This necessitates a sampling-based method, but the potential biases of such sampling methods are unknown and hard to quantify in the absence of ground truth. Second, numerically finding points with gradient exactly equal to zero is unlikely with finite precision, necessitating the setting of a tolerance for the norm of the gradient. This tolerance can be analytically determined for functions with Lipschitz conditions on the operator norm of the Hessian and on the norm of the gradient, but these values are poorly controlled for common neural networks (Gouk et al., 2018) and worst-case upper bounds might be overly pessimistic in the typical case. In the absence of ground truth, it is impossible to evaluate the efficacy of critical point-finding algorithms in solving these two problems. Here we make progress in this direction by making a thorough study of three critical point-finding algorithms on a model where ground truth is available, thereby clarifying algorithmic choices that may yield accurate results in more complicated cases.
The problem of finding the critical points of neural network loss surfaces is actually a specific instance of an old, general problem in disguise: the problem of finding the zeros, or roots, of a function. Here, that function is the gradient field of the loss surface. In multiple other areas, e.g. chemical physics (Wales, 2004; Ballard et al., 2017), numerical algebraic geometry (Sommese et al., 2005), and economics (Kehoe, 1987), work extending back decades has identified algorithms for this task with varying convergence properties, domains of application, caveats, and scalabilities.
In Sec. 2.2, we review methods used in previous papers that found the critical points of neural network loss surfaces in the context of this literature. We also introduce a new algorithm, originally invented for another purpose, as a critical point-finding algorithm for neural networks. We then apply these methods on a neural network loss surface where the ground truth identities of the critical points are known, that of a deep linear autoencoder (Baldi & Hornik, 1989)
, which minimizes the same loss over the same class of functions as does Principal Components Analysis (PCA), but in a different parameterization. Though linear networks do not have the same representational capacity as nonlinear networks, their training exhibits many of the properties of nonlinear network training(Saxe et al., 2013), and they provide a reasonably close test problem for algorithms of unknown efficacy.
We find that, while all the algorithms we study are capable of finding critical points (Fig. 1), strict cutoffs are necessary to ensure accuracy (Fig. 2), and all methods provide biased samples of the set of critical points (Figs. 3 & 4). We identify algorithmic choices that can improve performance and reduce this bias.
2 Methods for Sampling Critical Points
The problem of sampling the critical points of a loss surface requires two pieces to solve: first, an algorithm capable of finding a single critical point; second, a method for initializing this algorithm repeatedly in such a way that its outputs are a representative, ideally unbiased, sample of the true critical points.
The loss surface is a scalar function of the parameters of the neural network, We wish to understand the distribution of points where the gradient of the loss is zero by numerical means. Formally, we define these, the critical points (CPs) of , as the set
can be classified by means of its (fractional) index
, defined as the fractional number of negative eigenvaluesof the Hessian at that point: . Note that when the Hessian is negative semidefinite indicates a potential local maximum, always indicates a saddle point, and when the Hessian is positive semidefinite indicates a potential local minimum.
In practice, it is not possible to numerically locate points on where is identically zero. Hence we introduce a relaxed definition of a CP, an -CP, defined as the set
We will discuss in detail the effects of different choices of on the results of running critical point-finding algorithms, shown in Fig. 2.
In the context of neural network loss surfaces, an often-studied property of is the relationship of the indices of its members to their heights on the surface. (Dauphin et al., 2014) and (Pennington & Bahri, 2017) proposed models of and numerically found subsets of with the same loss-index relationship. In this paper, we identify algorithmic choices that lead to recovery of a subset of with approximately the same loss-index relationship as in a case where is known.
2.2 Finding a Single Critical Point
2.2.1 Gradient Norm Minimization
Given the problem of finding points that approximately satisfy a certain criterion, the natural optimization approach is to convert that criterion into a differentiable loss function and minimize it by local methods. For the problem of finding points with small gradient norm, the result is an auxiliary loss function
An instance of this class of methods was independently proposed for the problem of critical point-finding in neural networks in (Pennington & Bahri, 2017) for the first time, but in fact they have a long history in chemical physics, dating back to the 1970s under the name “gradient norm minimization” (GNM) (McIver & Komornicki, 1972), and being simultaneously and independently rediscovered thirty years later (Angelani et al., 2000; Broderix et al., 2000).
There are two major concerns with this class of methods. First, the problem is approximately quadratically worse-conditioned than the original problem (McIver & Komornicki, 1972), and neural network losses are already poorly-conditioned (Sagun et al., 2017), resulting in very slow convergence for first-order methods. Second, the loss surface of GNM, ironically, can suffer from a bad local minima property of its own, which is to say that it contains local minima that are not true critical points of the original loss surface. These arise anywhere that the gradient lies in the nullspace of the Hessian. It has been shown in the chemical physics literature that on some surfaces these spurious local minima dominate the global minima (Doye & Wales, 2002).
2.2.2 Newton Methods
Other approaches to finding points with zero gradient norm are better understood and have better convergence properties. In particular, the zeros of the gradient field are solutions to a nonlinear system of equations, , and can be found using root-finding techniques. The classic root-finding algorithm is Newton-Raphson (Izmailov & Solodov, 2014), which computes an update by solving the following linear system of equations:
Though this algorithm enjoys rapid local convergence (Boyd & Vandenberghe, 2004), it has no global convergence guarantees on non-smooth functions, and the radius of local convergence can be zero if the Hessian is singular at the solution (Griewank & Osborne, 1983). Indeed, early work on finding critical points of neural network loss surfaces found that Newton-Raphson (with a fixed, non-unit step size) often failed to converge (Coetzee & Stonick, 1997). Newton-Raphson is therefore typically augmented with additional machinery to guarantee global convergence on a wider class of functions. We consider two options in this paper.
The first, which we call Newton-TR, follows (Dauphin et al., 2014) and uses a Levenberg (Levenberg, 1944) scheme, equivalent to a trust region approach. Instead of solving Eq. 2, the modified equation
denotes the identity matrix, is solved for multiple fixed values of. In an optimization context, the update that results in the lowest value of is used. In the context of root-finding, we instead take the update that results in the lowest value of .
The second is based on the recently proposed Newton-MR method (Roosta et al., 2018), for “minimum residual”. As above, Eq. 2 is solved for , which is then used as a line search direction with a novel set of conditions, derived by applying the Wolfe conditions to the squared gradient norm (see (Roosta et al., 2018) for details). This method was proposed as an optimizer for functions with the property that the gradient is only zero at minimizers and proven to converge to points with low gradient norm. This makes it an appealing candidate for root-finding, unlike other Newton methods that are designed for more restricted classes of functions.
2.2.3 Other Methods
Here we review other methods for finding critical points of neural network loss surfaces and which provide promising targets for future research.
Eigenvector-following methods (Trygubenko & Wales, 2004) are commonly used to find critical points of a desired index in chemical physics. This is accomplished by initializing a quasi-Newton method, such as L-BFGS (Liu & Nocedal, 1989), at a local minimum and reversing the sign of the updates of that algorithm in a fixed set of directions at every step, corresponding to the desired index of the saddle being searched for. These methods are primarily used to find low-index saddles, rather than all critical points, and require prior knowledge of a local minimum. Eigenvector-following methods were applied to neural networks in (Ballard et al., 2017) and (Mehta et al., 2018a).
Homotopy methods comprise another class of approaches for root-finding, most prominently in numerical algebraic geometry (Sommese et al., 2005), where the polynomial form of the nonlinearity can be exploited. Homotopy methods use continuous transformations to deform solutions of a problem whose answers are known by construction into the solutions of a problem of interest (Davidenko, 1953; Broyden, 1969). They were first applied to neural network loss surfaces in (Coetzee & Stonick, 1997) and again more recently to linear network losses by (Mehta et al., 2018b). The latter took advantage of the polynomial structure of squared losses applied to linear networks to use advanced methods from numerical algebraic geometry (Bates et al., 2013). Outside of the case of polynomials, convergence guarantees are harder to come by.
2.3 Sampling Multiple Critical Points
Given an algorithm that can find a single critical point, the next problem is to define a method for initializing this algorithm repeatedly in such a way that the outputs form a representative sample of the true critical points. This presumes that the goal of examining the loss surface is to determine its mathematical properties, rather than the properties of, e.g., the parts of the loss surface which typical optimizers encounter from typical initializations. Two heuristics have been used in previous work. In both methods, the iterates of an optimization algorithm applied to the loss surface are used to propose points. In the first, from(Dauphin et al., 2014), these points are sampled uniformly from the iterates. We term this method “uniform iteration”. In the second, from (Pennington & Bahri, 2017), iterates are sampled uniformly according to their height on the loss surface. We term this method “uniform height”. As the identities of the critical points sampled by such a method are determined by the combination of loss surface shape, optimization algorithm behavior, and critical point-finding algorithm behavior, there is no guarantee of even sampling. In an attempt to reduce any possible sampling bias, (Dauphin et al., 2014) perturbed sampled points with additive noise. We compare the sampling bias of both initialization methods in their noisy and noiseless versions below (see Figs. 3 & 4).
3 The Deep Linear Autoencoder: A Model with Ground Truth
Linear deep networks exhibit many of the training (Saxe et al., 2013) and generalization (Advani & Saxe, 2017) complexities of their nonlinear counterparts while simultaneously being more amenable to analysis. The loss surface of a feedforward single-hidden layer linear network with squared error loss was first studied in (Baldi & Hornik, 1989), where the authors demonstrated that there are no non-global minima. Recent work in linear networks has extended that result to multi-hidden layer networks (Zhou & Liang, 2017) and to arbitrary differentiable convex losses (Laurent & von Brecht, 2017). Unlike in nonlinear networks, the identity of every critical point is known in the single-hidden layer, linear case. This provides a ground truth against which numerical results can be compared. Here we study the performance of critical point finding algorithms on a single-hidden layer deep linear autoencoder (DLAE) with sixteen input and output units and four hidden units, applied to Gaussian data with a diagonal covariance matrix and evenly spaced eigenvalues between 1 and 16. What follows is a description of the set of critical points of the loss surface of this network, after (Baldi & Hornik, 1989).
Each critical point of a DLAE with input units and hidden units () corresponds to a projection of the data onto a space spanned by some combination of at most of the eigenvectors of the data covariance matrix. In the network parameterization, this projection matrix (rank ) is factored into the and
input and output weight matrices. This factorization is only unique up to an invertible linear transformation, and so each critical point is part of a disconnected Lie group of critical points.111
This situation is also partially shared by ReLU networks(Freeman & Bruna, 2016). In the following, we only consider critical points modulo this equivalence relation, up to which the total number of critical points is given by the sum of the numbers of ways to choose from 0 to elements from an element set: . The loss surface of the DLAE we consider therefore contains a single minimum and
saddles. The minimum corresponds to a solution where the network has learned to project onto the four eigenvectors with the largest eigenvalues, and the saddles correspond to all other projections. Note that the single critical point that corresponds to learning none of the eigenvectors, resulting in a parameter vector of all zeros, shall be referred to as the “critical point at zero” later on.
It is possible to construct the weight matrices for the DLAE so as to initialize it to a particular critical point: the input-to-hidden weight matrix is constructed by placing the eigenvectors represented at that critical point in its columns, while the hidden-to-output weight matrix is its transpose. This allows us to compute every element of , up to equivalence, and then compute the true values of, e.g. the loss and index. These values can then be compared to those computed from a subset of , i.e. critical points obtained via numerical methods.
To find critical points, we first computed optimization trajectories by training a linear network on Gaussian data with the squared error loss. These optimization trajectories were then used to generate initial points for critical point-finding algorithms. Unless otherwise stated, initial points were selected uniformly with respect to height on the loss surface. Critical point-finding algorithms were terminated either when no proposed step size met acceptance criteria or when a maximum number of epochs was reached. Returned points were accepted as numerical critical points if their squared gradient norm was less than 1e-10.
To optimize the gradient norm objective (Eq. 1), we use batch gradient descent with back-tracking line search using the Wolfe conditions (Wolfe, 1971). We use fast Hessian-vector products (Pearlmutter, 1994) to compute our updates with the AutoGrad (Maclaurin, 2016) Python package.
For both Newton methods, we use the robust linear solver MR-QLP (Choi et al., 2011). See (Roosta et al., 2018) for a succinct explanation of the benefits of this solver for poorly-conditioned, indefinite problems. This solver is also accelerated by the use of fast Hessian-vector products.
See Supplementary Materials for further details and hyperparameter values.
4.1 Newton Methods Outperform Gradient Norm Minimization
-CPs found by all three numerical methods can have the same qualitative and quantitative loss and index values as do true CPs, but the methods have varying efficiencies (Fig. 1). With the selected convergence criteria (see caption), none of the methods find -CPs in regions of the loss-index plane where there are no true CPs (e.g. bad local minima, in the top left quadrant, or low-lying saddles, in the bottom right quadrant). All methods also find subsets of that span the same values of loss and index as does .
Newton-MR (Fig. 1, left column) terminates in fewer iterations than does Newton-TR (Fig. 1, middle column; medians 221 and 430; Mann-Whitney = 4768.5, ). Newton-TR furthermore requires multiple calls to MR-QLP per iteration (one for each choice of trust region size), and so Newton-MR terminates more quickly (26.5 s per 100 iterations for Newton-MR vs 1 min 39 s for NewtonTR, on commodity hardware). A more sophisticated mechanism for determining trust region size, rather than simply selecting the best choice from a pre-defined set, might narrow this performance gap.
Gradient norm minimization (GNM; Fig. 1, right column) is less efficient in two ways. First, successful runs of GNM take on the order of one hundred times as many epochs to reach the same value of the gradient norm as do the Newton methods. Each iteration only requires a single Hessian-vector multiply, unlike the Newton methods, whose calls to MR-QLP require several Hessian-vector multiplies, but the difference in wall time is still more than an order of magnitude (6.8 s per 100 iterations for GNM). Furthermore, as discussed above in Section 2, GNM tends to get stuck in local minima of its objective function, as evidenced by the numerous short orange traces which terminate without the gradient norm going below 1e-10 (62.7% of runs). Because of this, even though the same number of runs and more compute were given to GNM, the number of -CPs recovered is far smaller (cf. Fig. 1, bottom-right and bottom-left panels). Note that local minima of the gradient norm objective do not correspond to local minima, or critical points at all, of the original problem. Thus we conclude that Newton methods in general, and Newton-MR in particular, are a better choice of critical point-finding algorithm for neural network loss surfaces.
4.2 Strict Cutoffs are Necessary to Accurately Recover Critical Points
Here and in previous work, the final output of a critical point-finding algorithm was accepted if the squared gradient norm on termination was below some cutoff value (here, 1e-10). However, at termination, many points may be far below this cutoff. In fact, the vast majority of runs terminate with squared gradient norm close to 1e-30, and so the results in Fig. 1 do not demonstrate that simply having squared gradient norm at the cutoff 1e-10 is sufficient to guarantee a match between the loss and index values of elements of and of . If a sufficient cutoff could be identified, then much iteration time could be saved by terminating runs as soon as the squared gradient norm went below that value.
We found that, while having squared gradient norm equal to our cutoff was sufficient to guarantee accurate recovery of loss and index values, for the cutoff values used in (Dauphin et al., 2014) and (Pennington & Bahri, 2017), it was not (Fig. 2). The error is larger for lower values of the loss. Note, however, that the overall shape of the loss-index relationship is largely preserved for these looser cutoffs. Interestingly, we find that simply computing the loss and index of points along the optimization trajectory results in the same concave-up shape reported in previous work (Dauphin et al., 2014; Pennington & Bahri, 2017) (Fig. 2, bottom-right panel), underscoring the need for care in the selection of convergence criteria.
4.3 Adding Noise Reduces Sampling Bias
As discussed above, neither sampling uniformly from the trajectory nor from height on the loss surface guarantee an unbiased sample of the set . Indeed, the sample of critical points found using a single optimization trajectory as a seed using a Newton method is heavily biased (Fig. 3, top left panel). First, samples from all trajectories evinced a bias towards critical points that perform projection onto eigenvectors with large eigenvalue. Second, individual trajectories are similarly biased towards a few other, seemingly random eigenvectors (e.g. 10 and 12 for the trajectory in green; 9 for the trajectory in orange). We quantified this bias by taking the distribution of targeted eigenvectors (plotted as histograms in Fig. 3, top left panel) and computing the entropy (Fig. 3, top right panel).
To reduce this sampling bias, (Dauphin et al., 2014) proposed adding noise to the values sampled from the trajectory. We investigated whether this approach worked with Gaussian noise. All results presented in this section were obtained using Newton-MR. Results for Newton-TR were qualitatively similar. We found that bias was partially reduced for appropriate choices of noise variance (Fig. 3, top right panel and bottom row). For noise with low variance (Fig. 3, bottom-left-most panel), there was no substantial change in the either the entropy or the identity of the recovered critical points. For noise with high variance (Fig. 3, bottom-right-most panel), adding noise did not increase the entropy and amplified the bias towards critical points projecting onto eigenvectors with large eigenvalues. This result is somewhat curious, since the loss values of the noise-corrupted points are higher, not lower, than the originals. It indicates that Newton methods do not always converge to points nearby in loss value. For intermediate values of noise variance (Fig. 3, bottom-center panel), adding noise reduced the bias, as quantified by the entropy (bootstrap () means 3.05 bits, no noise, and 3.34 bits, ; compared with Student’s : , ), and resulted in a closer match between the distributions of the indices of the true and computed CPs (see histograms in Fig. 3, bottom row). This value of corresponded to a signal-to-noise ratio of 2.8 dB, compared to SNRs of 6.8 dB and 0.8 dB in the other cases. This suggests that, to optimally reduce bias, added noise must be of sufficient magnitude to perturb the parameters but not of lower magnitude than the parameters themselves.
4.4 Sampling Bias is Worse when Initializing Uniformly Across Iterations
Previous work sampled initial points for critical point-finding algorithms from optimization trajectories either uniformly across iterations (Dauphin et al., 2014) or uniformly in height on the loss surface (Pennington & Bahri, 2017). Note that all results reported above used the latter method. We compared these two methods and found that sampling uniformly across iterations resulted in a heavy bias towards critical points projecting onto dominant eigenvectors (Fig. 4, top left and bottom left panels). We quantified this with the entropy of the distribution of eigenvectors onto which critical points projected, as above, and found that the entropy was significantly lower when sampling uniformly across iterations (Fig. 4, top right panel). Adding Gaussian noise to the sampled values had no effect for small values and did not fully counteract the reduction in entropy for intermediate values (Fig. 4, bottom row; uniform height vs uniform trajectory: means bits, bits, , ; uniform height vs uniform trajectory, : means bits, bits, , ).
We examined the performance of three methods to numerically find the critical points of neural network loss surfaces on a test problem for which the ground truth critical points are known. In the absence of this ground truth and in the presence of numerical and convergence concerns, the fidelity of reported critical points to true critical points is unknown. We found that, while all three methods are capable of finding the critical points of a deep linear autoencoder, the Newton method-based algorithms outperform direct gradient norm minimization (GNM). As predicted by theory, GNM frequently (on 62.7% of runs) gets stuck in local minima and requires two orders of magnitude more iterations to converge, likely due to extremely poor conditioning.
We identified a recently-proposed Newton method, Newton-MR, as a promising candidate algorithm and found that it produced solutions with fewer iterations and in less wall time than the trust region Newton method used in (Dauphin et al., 2014). The possible applicability of quasi-Newton methods (Liu & Nocedal, 1989) for large problems remains unexplored and is left to future investigation. Newton-based homotopy methods (Mehta et al., 2015) are another promising future direction, based on preliminary results in (Coetzee & Stonick, 1997).
Numerically-recovered critical points rarely have gradient norm exactly zero, and the appropriate cutoff value for faithfully representing the loss and index values of the true critical points of a neural network is unknown. Our results suggest that for precise recovery of loss and index values, this cutoff in the squared norm should be stricter (1e-10) than previously reported (1e-06 in (Pennington & Bahri, 2017), unreported in (Dauphin et al., 2014)), presuming that the relevant Lipschitz constants for nonlinear networks are at least as large as those for linear networks.
Given the infeasibility of calculating all of the critical points of the loss surface, the goal of critical point-finding methods should be to produce an unbiased picture of the true critical points. We find that previously reported sampling methods based on optimization trajectories are biased towards critical points at low values of the loss and towards a random other subset of critical points, possibly determined by which critical points were closest to the optimization trajectory. Adding noise was insufficient to remove this bias, but did mitigate it slightly. Interestingly, we found that adding larger amounts of noise actually increased the bias towards critical points at low values of the loss. Further work is needed to explain this phenomenon.
Taken together, our findings suggest that numerically recovering the critical points of nonlinear neural networks with high accuracy is possible, even if the problem of obtaining an even sample is still unsolved. For questions regarding just the parts of the loss surface that a typical training trajectory passes through, relevant for answering questions about optimization, the latter is not such a serious problem. However, it remains problematic for questions about the mathematical properties of the ensemble of critical points. Answering these questions for a wide variety of neural networks can provide insight into which properties of the loss surface are most relevant for neural network training and generalization and by which mechanisms they arise.
The authors would like to thank Jesse Livezey, Shariq Mobin, Jascha Sohl-Dickstein, Max Simchowitz, Levent Sagun, Yasaman Bahri, and Jeffrey Pennington for useful discussions. Author CF was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814. Authors CF & KB were funded by a DOE/LBNL LDRD, ‘Deep Learning for Science’, (PI, Prabhat). NW was supported by the Google PhD Fellowship. MRD was supported in part by the U. S. Army Research Laboratory and the U. S. Army Research Office under contract W911NF-13-1-0390.
- Advani & Saxe (2017) Advani, M. S. and Saxe, A. M. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
- Angelani et al. (2000) Angelani, L., Leonardo, R. D., Ruocco, G., Scala, A., and Sciortino, F. Saddles in the energy landscape probed by supercooled liquids. Physical Review Letters, 85(25):5356–5359, 2000.
Auer et al. (1996)
Auer, P., Herbster, M., and Warmuth, M. K.
Exponentially many local minima for single neurons.In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E. (eds.), Advances in Neural Information Processing Systems 8, pp. 316–322. MIT Press, 1996.
- Baldi & Hornik (1989) Baldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53 – 58, 1989.
- Ballard et al. (2017) Ballard, A. J., Das, R., Martiniani, S., Mehta, D., Sagun, L., Stevenson, J. D., and Wales, D. J. Energy landscapes for machine learning. Phys. Chem. Chem. Phys., 19:12585–12603, 2017.
- Bates et al. (2013) Bates, D. J., Haunstein, J. D., Sommese, A. J., and Wampler, C. W. Numerically Solving Polynomial Systems with Bertini (Software, Environments and Tools). Society for Industrial and Applied Mathematics, 2013. ISBN 1611972698.
- Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. ISBN 0521833787.
- Broderix et al. (2000) Broderix, K., Bhattacharya, K. K., Cavagna, A., Zippelius, A., and Giardina, I. Energy landscape of a Lennard-Jones liquid: statistics of stationary points. Physical Review Letters, 85(25):5360–5363, 2000.
- Broyden (1969) Broyden, C. G. A new method of solving nonlinear simultaneous equations. The Computer Journal, 12(1):94–99, 1969.
- Choi et al. (2011) Choi, S.-C. T., Paige, C. C., and Saunders, M. A. MINRES-QLP: A Krylov subspace method for indefinite or singular symmetric systems. SIAM Journal on Scientific Computing, 33(4):1810–1836, 2011.
- Choromanska et al. (2014) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The loss surface of multilayer networks. CoRR, abs/1412.0233, 2014.
- Coetzee & Stonick (1997) Coetzee, F. and Stonick, V. L. 488 solutions to the XOR problem. In Mozer, M. C., Jordan, M. I., and Petsche, T. (eds.), Advances in Neural Information Processing Systems 9, pp. 410–416. MIT Press, 1997.
- Dauphin et al. (2014) Dauphin, Y., Pascanu, R., Gülçehre, Ç., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. CoRR, abs/1406.2572, 2014.
- Davidenko (1953) Davidenko, D. F. On a new method of numerical solution of systems of nonlinear equations. Dokl. Akad. Nauk SSSR, 88:601–602, 1953.
- Dinh et al. (2017) Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp minima can generalize for deep nets. CoRR, abs/1703.04933, 2017.
- Doye & Wales (2002) Doye, J. P. K. and Wales, D. J. Saddle points and dynamics of Lennard-Jones clusters, solids, and supercooled liquids. The Journal of Chemical Physics, 116(9):3777–3788, 2002.
- Freeman & Bruna (2016) Freeman, C. D. and Bruna, J. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
- Gouk et al. (2018) Gouk, H., Frank, E., Pfahringer, B., and Cree, M. Regularisation of neural networks by enforcing Lipschitz continuity. arXiv preprint arXiv:1804.04368, 2018.
- Griewank & Osborne (1983) Griewank, A. and Osborne, M. R. Analysis of Newton’s method at irregular singularities. SIAM Journal on Numerical Analysis, 20(4):747–773, 1983. ISSN 00361429.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Flat minima. Neural Computation, 9(1):1–42, 1997.
- Izmailov & Solodov (2014) Izmailov, A. F. and Solodov, M. V. Newton-Type Methods for Optimization and Variational Problems. Springer International Publishing, 2014. doi: 10.1007/978-3-319-04247-3.
- Jin et al. (2017) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. CoRR, abs/1703.00887, 2017.
- Jin et al. (2018) Jin, C., Liu, L. T., Ge, R., and Jordan, M. I. Minimizing nonconvex population risk from rough empirical risk. CoRR, abs/1803.09357, 2018.
- Kehoe (1987) Kehoe, T. J. Computation and multiplicity of economic equilibria. The Economy as an Evolving Complex System, pp. 147–167, 1987.
- Laurent & von Brecht (2017) Laurent, T. and von Brecht, J. Deep linear neural networks with arbitrary loss: all local minima are global. CoRR, abs/1712.01473, 2017.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521:436 – 444, 2015.
- Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 1246–1257, Columbia University, New York, New York, USA, 2016. PMLR.
- Levenberg (1944) Levenberg, K. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 2(2):164–168, 1944.
- Liang et al. (2018) Liang, S., Sun, R., Li, Y., and Srikant, R. Understanding the loss surface of neural networks for binary classification. CoRR, abs/1803.00909, 2018.
- Liu & Nocedal (1989) Liu, D. C. and Nocedal, J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1-3):503–528, 1989.
- Maclaurin (2016) Maclaurin, D. Modeling, Inference and Optimization with Composable Differentiable Procedures. PhD thesis, Harvard University, April 2016.
- McIver & Komornicki (1972) McIver, J. W. and Komornicki, A. Structure of transition states in organic reactions. general theory and an application to the cyclobutene-butadiene isomerization using a semiempirical molecular orbital method. Journal of the American Chemical Society, 94(8):2625–2633, 1972.
- Mehta et al. (2015) Mehta, D., Chen, T., Morgan, J. W. R., and Wales, D. J. Exploring the potential energy landscape of the Thomson problem via Newton homotopies. The Journal of Chemical Physics, 142(19):194113, 2015.
- Mehta et al. (2018a) Mehta, D., Chen, T., Tang, T., and Hauenstein, J. D. The loss surface of deep linear networks viewed through the algebraic geometry lens, 2018a.
- Mehta et al. (2018b) Mehta, D., Zhao, X., Bernal, E. A., and Wales, D. J. Loss surface of XOR artificial neural networks. Physical Review E, 97(5), 2018b.
- Pascanu et al. (2014) Pascanu, R., Dauphin, Y. N., Ganguli, S., and Bengio, Y. On the saddle point problem for non-convex optimization. CoRR, abs/1405.4604, 2014.
- Pearlmutter (1994) Pearlmutter, B. A. Fast exact multiplication by the Hessian. Neural Computation, 6:147–160, 1994.
Pennington & Bahri (2017)
Pennington, J. and Bahri, Y.
Geometry of neural network loss surfaces via random matrix theory.In International Conference on Learning Representations (ICLR), 2017.
- Roosta et al. (2018) Roosta, F., Liu, Y., Xu, P., and Mahoney, M. W. Newton-MR: Newton’s method without smoothness or convexity. arXiv preprint arXiv:1810.00303, 2018.
- Sagun et al. (2017) Sagun, L., Evci, U., Güney, V. U., Dauphin, Y., and Bottou, L. Empirical analysis of the Hessian of over-parametrized neural networks. CoRR, abs/1706.04454, 2017.
- Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
- Schmidhuber (2014) Schmidhuber, J. Deep learning in neural networks: An overview. CoRR, abs/1404.7828, 2014.
- Sommese et al. (2005) Sommese, A. J., Verschelde, J., and Wampler, C. W. Introduction to numerical algebraic geometry. In Solving polynomial equations, pp. 301–337. Springer, 2005.
- Trygubenko & Wales (2004) Trygubenko, S. A. and Wales, D. J. A doubly nudged elastic band method for finding transition states. The Journal of Chemical Physics, 120(5):2082–2094, 2004.
- Wales (2004) Wales, D. Energy Landscapes: Applications to Clusters, Biomolecules and Glasses. Cambridge University Press, 2004.
- Wolfe (1971) Wolfe, P. Convergence conditions for ascent methods. II: Some corrections. SIAM Review, 13(2):185–188, 1971.
- Yao et al. (2018) Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessian-based analysis of large batch training and robustness to adversaries. CoRR, abs/1802.08241, 2018.
- Zhou & Liang (2017) Zhou, Y. and Liang, Y. Critical points of neural networks: analytical forms and landscape properties. arXiv preprint arXiv:1710.11205v1, 2017.
6 Supplementary Material
6.1 Detailed Methods
The input data to the network was a 10,000 element sample from a 16-dimensional Gaussian with mean zero and diagonal covariance matrix with entries . Because the analytical results (Baldi & Hornik, 1989) were derived for centered data, the data was then zero-centered to floating precision by subtracting the sample mean.
6.3 Network and Training
The linear autoencoder network had 16 input units, 4 hidden units, and 16 output units. Initial parameter values were sampled from a Gaussian with variance equal to the inverse of the number of weights in each weight matrix. The network was then trained for 10,000 epochs of full-batch gradient descent with a fixed learning rate of 0.01. Final parameter values had losses within 5e-6 of the loss of the global minimizer (starting from losses above it by order 1).
6.4 Back-Tracking Line Search
The step sizes for minimizing the gradient norm objective (Equation 1) and along the Newton-MR search direction were computed using back-tracking line search. The initial step size was 0.1 and the step size was multiplied by 0.5 when a proposed update failed to meet the update criteria (Wolfe conditions for GNM (Wolfe, 1971), criteria from (Roosta et al., 2018) for Newton-MR). In both cases, the free parameter for the Armijo-type condition was set to 1e-04. In the former, the free parameter for the curvature-type condition was set to 0.9. After a step was accepted, step size was multiplied by 2 and used as the initial step size for the next step. Line search was terminated when multiplying the proposed step size by 0.5 resulted in no change in the value of the step size, in the number type in use. All of our experiments used double-precision floats.
For the Newton-TR update (Equation 3), we used 5 evenly log-spaced values of from 1 to 1e-04. We found that smaller ranges reduced convergence. We used a step size of 0.1.
6.6 Calculating Index and Identifying Critical Points
To calculate index numerically, a minimum negative eigenvalue tolerance must be set. We chose 1e-05, the square root of our criterion for the squared gradient norm.
Critical points were identified as follows. First the un-factorized linear map that the network applies was calculated by multiplying together the input and output weights. Near a critical point, this map should be strongly diagonally dominant, thanks to the diagonal structure of the data’s true covariance matrix. We found that simply identifying which diagonal elements were above 0.9 was sufficient to determine critical point identity: doing so did not result in identifying putative critical points that performed projection onto more than 4 eigenvectors.