DeepAI
Log In Sign Up

Numerically Recovering the Critical Points of a Deep Linear Autoencoder

01/29/2019
by   Charles G. Frye, et al.
0

Numerically locating the critical points of non-convex surfaces is a long-standing problem central to many fields. Recently, the loss surfaces of deep neural networks have been explored to gain insight into outstanding questions in optimization, generalization, and network architecture design. However, the degree to which recently-proposed methods for numerically recovering critical points actually do so has not been thoroughly evaluated. In this paper, we examine this issue in a case for which the ground truth is known: the deep linear autoencoder. We investigate two sub-problems associated with numerical critical point identification: first, because of large parameter counts, it is infeasible to find all of the critical points for contemporary neural networks, necessitating sampling approaches whose characteristics are poorly understood; second, the numerical tolerance for accurately identifying a critical point is unknown, and conservative tolerances are difficult to satisfy. We first identify connections between recently-proposed methods and well-understood methods in other fields, including chemical physics, economics, and algebraic geometry. We find that several methods work well at recovering certain information about loss surfaces, but fail to take an unbiased sample of critical points. Furthermore, numerical tolerance must be very strict to ensure that numerically-identified critical points have similar properties to true analytical critical points. We also identify a recently-published Newton method for optimization that outperforms previous methods as a critical point-finding algorithm. We expect our results will guide future attempts to numerically study critical points in large nonlinear neural networks.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/08/2020

The critical locus of overparameterized neural networks

Many aspects of the geometry of loss functions in deep learning remain m...
06/12/2019

Critical Point Finding with Newton-MR by Analogy to Computing Square Roots

Understanding of the behavior of algorithms for resolving the optimizati...
11/08/2021

Inertial Newton Algorithms Avoiding Strict Saddle Points

We study the asymptotic behavior of second-order algorithms mixing Newto...
04/08/2021

Numerics and analysis of Cahn–Hilliard critical points

We explore recent progress and open questions concerning local minima an...
04/26/2010

Finding and Classifying Critical Points of 2D Vector Fields: A Cell-Oriented Approach Using Group Theory

We present a novel approach to finding critical points in cell-wise bary...
12/09/2021

Critical configurations for two projective views, a new approach

The problem of structure from motion is concerned with recovering 3-dime...

1 Introduction

Neural networks have pushed forward the state of the art in a variety of machine learning tasks 

(Schmidhuber, 2014; LeCun et al., 2015)

, but little is known about precisely how they work or why they can be effectively optimized to solutions with good test set performance. In particular, it is not known why the non-convex training problem of a neural network is soluble using popular local methods such as stochastic gradient descent. One approach to answering this question involves direct numerical interrogation of the loss surfaces of neural networks. Efforts in this vein 

(Pascanu et al., 2014; Dauphin et al., 2014; Choromanska et al., 2014; Pennington & Bahri, 2017) have demonstrated cases in which these loss surfaces enjoy a favorable “no bad local minima” property (that is, all minima are located at low values of the loss), and contain a proliferation of saddles instead, which was believed to be problematic for first-order methods. Later theoretical work in non-convex optimization then showed that some stochastic first-order methods avoid saddle points of the training loss (Lee et al., 2016; Jin et al., 2017) and converge to minima that generalize to the test loss (Jin et al., 2018). Taken together, these claims would do much to explain the trainability of neural networks.

However, a thorough understanding of the trainability of neural networks remains to be achieved. In particular, much is unknown regarding the properties of the loss surface, on which we focus in this paper. The theoretical results cited above concerning the shape of the loss surface are asymptotic and apply to simplified, approximate models (Choromanska et al., 2014; Pennington & Bahri, 2017). The rates at which these asymptotic results apply and the degree to which these approximations break down as properties of the data and architecture are varied are unknown. Furthermore, recent theoretical results (Liang et al., 2018) have indicated that the loss surfaces for some common neural network architectures, e.g. those with ReLUs and without skip connections, suffer from poor local minima even for linearly separable data. Lastly, there is debate in the field on whether minima with different generalization performance exist and are characterized by their curvature properties (Hochreiter & Schmidhuber, 1997; Dinh et al., 2017; Yao et al., 2018). Taken together, all these results underscore the need for further study of the shape of the loss surface.

One way to determine to what extent and why neural network loss surfaces can suffer from bad local minima, and what effect the distribution of minima, maxima, and saddle points (generically, points with zero gradient norm, or critical points) have on optimization and generalization, is to perform a thorough empirical examination of the interplay between architecture, data, and initialization strategy as they affect the loss surface. In order to do so, robust numerical methods for finding critical points are needed.

Unfortunately, the problem of finding all of the critical points of a neural network loss surface is non-trivial. First, the rapid scaling of the number of critical points with input dimension (Baldi & Hornik, 1989; Auer et al., 1996), ignoring continuous equivalences (Freeman & Bruna, 2016), is such that finding all of them can be impossible for any but the smallest of architectures. This necessitates a sampling-based method, but the potential biases of such sampling methods are unknown and hard to quantify in the absence of ground truth. Second, numerically finding points with gradient exactly equal to zero is unlikely with finite precision, necessitating the setting of a tolerance for the norm of the gradient. This tolerance can be analytically determined for functions with Lipschitz conditions on the operator norm of the Hessian and on the norm of the gradient, but these values are poorly controlled for common neural networks (Gouk et al., 2018) and worst-case upper bounds might be overly pessimistic in the typical case. In the absence of ground truth, it is impossible to evaluate the efficacy of critical point-finding algorithms in solving these two problems. Here we make progress in this direction by making a thorough study of three critical point-finding algorithms on a model where ground truth is available, thereby clarifying algorithmic choices that may yield accurate results in more complicated cases.

The problem of finding the critical points of neural network loss surfaces is actually a specific instance of an old, general problem in disguise: the problem of finding the zeros, or roots, of a function. Here, that function is the gradient field of the loss surface. In multiple other areas, e.g. chemical physics (Wales, 2004; Ballard et al., 2017), numerical algebraic geometry (Sommese et al., 2005), and economics (Kehoe, 1987), work extending back decades has identified algorithms for this task with varying convergence properties, domains of application, caveats, and scalabilities.

In Sec. 2.2, we review methods used in previous papers that found the critical points of neural network loss surfaces in the context of this literature. We also introduce a new algorithm, originally invented for another purpose, as a critical point-finding algorithm for neural networks. We then apply these methods on a neural network loss surface where the ground truth identities of the critical points are known, that of a deep linear autoencoder (Baldi & Hornik, 1989)

, which minimizes the same loss over the same class of functions as does Principal Components Analysis (PCA), but in a different parameterization. Though linear networks do not have the same representational capacity as nonlinear networks, their training exhibits many of the properties of nonlinear network training 

(Saxe et al., 2013), and they provide a reasonably close test problem for algorithms of unknown efficacy.

We find that, while all the algorithms we study are capable of finding critical points (Fig. 1), strict cutoffs are necessary to ensure accuracy (Fig. 2), and all methods provide biased samples of the set of critical points (Figs. 3 & 4). We identify algorithmic choices that can improve performance and reduce this bias.

2 Methods for Sampling Critical Points

The problem of sampling the critical points of a loss surface requires two pieces to solve: first, an algorithm capable of finding a single critical point; second, a method for initializing this algorithm repeatedly in such a way that its outputs are a representative, ideally unbiased, sample of the true critical points.

2.1 Preliminaries

The loss surface is a scalar function of the parameters of the neural network, We wish to understand the distribution of points where the gradient of the loss is zero by numerical means. Formally, we define these, the critical points (CPs) of , as the set

A CP

can be classified by means of its (fractional) index

, defined as the fractional number of negative eigenvalues

of the Hessian at that point: . Note that when the Hessian is negative semidefinite indicates a potential local maximum, always indicates a saddle point, and when the Hessian is positive semidefinite indicates a potential local minimum.

In practice, it is not possible to numerically locate points on where is identically zero. Hence we introduce a relaxed definition of a CP, an -CP, defined as the set

We will discuss in detail the effects of different choices of on the results of running critical point-finding algorithms, shown in Fig. 2.

In the context of neural network loss surfaces, an often-studied property of is the relationship of the indices of its members to their heights on the surface. (Dauphin et al., 2014) and (Pennington & Bahri, 2017) proposed models of and numerically found subsets of with the same loss-index relationship. In this paper, we identify algorithmic choices that lead to recovery of a subset of with approximately the same loss-index relationship as in a case where is known.

2.2 Finding a Single Critical Point

2.2.1 Gradient Norm Minimization

Given the problem of finding points that approximately satisfy a certain criterion, the natural optimization approach is to convert that criterion into a differentiable loss function and minimize it by local methods. For the problem of finding points with small gradient norm, the result is an auxiliary loss function

(1)

An instance of this class of methods was independently proposed for the problem of critical point-finding in neural networks in (Pennington & Bahri, 2017) for the first time, but in fact they have a long history in chemical physics, dating back to the 1970s under the name “gradient norm minimization” (GNM)  (McIver & Komornicki, 1972), and being simultaneously and independently rediscovered thirty years later (Angelani et al., 2000; Broderix et al., 2000).

There are two major concerns with this class of methods. First, the problem is approximately quadratically worse-conditioned than the original problem (McIver & Komornicki, 1972), and neural network losses are already poorly-conditioned (Sagun et al., 2017), resulting in very slow convergence for first-order methods. Second, the loss surface of GNM, ironically, can suffer from a bad local minima property of its own, which is to say that it contains local minima that are not true critical points of the original loss surface. These arise anywhere that the gradient lies in the nullspace of the Hessian. It has been shown in the chemical physics literature that on some surfaces these spurious local minima dominate the global minima (Doye & Wales, 2002).

2.2.2 Newton Methods

Other approaches to finding points with zero gradient norm are better understood and have better convergence properties. In particular, the zeros of the gradient field are solutions to a nonlinear system of equations, , and can be found using root-finding techniques. The classic root-finding algorithm is Newton-Raphson (Izmailov & Solodov, 2014), which computes an update by solving the following linear system of equations:

(2)

Though this algorithm enjoys rapid local convergence (Boyd & Vandenberghe, 2004), it has no global convergence guarantees on non-smooth functions, and the radius of local convergence can be zero if the Hessian is singular at the solution (Griewank & Osborne, 1983). Indeed, early work on finding critical points of neural network loss surfaces found that Newton-Raphson (with a fixed, non-unit step size) often failed to converge (Coetzee & Stonick, 1997). Newton-Raphson is therefore typically augmented with additional machinery to guarantee global convergence on a wider class of functions. We consider two options in this paper.

The first, which we call Newton-TR, follows (Dauphin et al., 2014) and uses a Levenberg (Levenberg, 1944) scheme, equivalent to a trust region approach. Instead of solving Eq. 2, the modified equation

(3)

where

denotes the identity matrix, is solved for multiple fixed values of

. In an optimization context, the update that results in the lowest value of is used. In the context of root-finding, we instead take the update that results in the lowest value of .

The second is based on the recently proposed Newton-MR method (Roosta et al., 2018), for “minimum residual”. As above, Eq. 2 is solved for , which is then used as a line search direction with a novel set of conditions, derived by applying the Wolfe conditions to the squared gradient norm (see (Roosta et al., 2018) for details). This method was proposed as an optimizer for functions with the property that the gradient is only zero at minimizers and proven to converge to points with low gradient norm. This makes it an appealing candidate for root-finding, unlike other Newton methods that are designed for more restricted classes of functions.

2.2.3 Other Methods

Here we review other methods for finding critical points of neural network loss surfaces and which provide promising targets for future research.

Eigenvector-following methods (Trygubenko & Wales, 2004) are commonly used to find critical points of a desired index in chemical physics. This is accomplished by initializing a quasi-Newton method, such as L-BFGS (Liu & Nocedal, 1989), at a local minimum and reversing the sign of the updates of that algorithm in a fixed set of directions at every step, corresponding to the desired index of the saddle being searched for. These methods are primarily used to find low-index saddles, rather than all critical points, and require prior knowledge of a local minimum. Eigenvector-following methods were applied to neural networks in (Ballard et al., 2017) and (Mehta et al., 2018a).

Homotopy methods comprise another class of approaches for root-finding, most prominently in numerical algebraic geometry (Sommese et al., 2005), where the polynomial form of the nonlinearity can be exploited. Homotopy methods use continuous transformations to deform solutions of a problem whose answers are known by construction into the solutions of a problem of interest (Davidenko, 1953; Broyden, 1969). They were first applied to neural network loss surfaces in (Coetzee & Stonick, 1997) and again more recently to linear network losses by (Mehta et al., 2018b). The latter took advantage of the polynomial structure of squared losses applied to linear networks to use advanced methods from numerical algebraic geometry (Bates et al., 2013). Outside of the case of polynomials, convergence guarantees are harder to come by.

2.3 Sampling Multiple Critical Points

Given an algorithm that can find a single critical point, the next problem is to define a method for initializing this algorithm repeatedly in such a way that the outputs form a representative sample of the true critical points. This presumes that the goal of examining the loss surface is to determine its mathematical properties, rather than the properties of, e.g., the parts of the loss surface which typical optimizers encounter from typical initializations. Two heuristics have been used in previous work. In both methods, the iterates of an optimization algorithm applied to the loss surface are used to propose points. In the first, from 

(Dauphin et al., 2014), these points are sampled uniformly from the iterates. We term this method “uniform iteration”. In the second, from (Pennington & Bahri, 2017), iterates are sampled uniformly according to their height on the loss surface. We term this method “uniform height”. As the identities of the critical points sampled by such a method are determined by the combination of loss surface shape, optimization algorithm behavior, and critical point-finding algorithm behavior, there is no guarantee of even sampling. In an attempt to reduce any possible sampling bias, (Dauphin et al., 2014) perturbed sampled points with additive noise. We compare the sampling bias of both initialization methods in their noisy and noiseless versions below (see Figs. 34).

Figure 1: Newton-MR, Newton-TR, and GNM can recover critical points of a deep linear autoencoder

. Top panels show squared gradient norms across epochs (both in log scale). Black lines correspond to runs that terminated below a criterion value, while orange lines correspond to runs that terminated above it. Due to the greater density of floating point numbers around zero, trajectories converging to the critical point at zero can achieve much lower squared gradient norms (as low as 1e-253); the y-axis is cut off at 1e-40 to focus on the other critical points. Right panels show the loss and index for critical points found using numerical algorithms (in green) overlaid on the true values (in gray). Each method was executed a total of 150 times: 10 optimization trajectories (each with a different random initialization) were used as seeds, with 15 initial points for each critical point-finding algorithm chosen at random from each trajectory.

3 The Deep Linear Autoencoder: A Model with Ground Truth

Linear deep networks exhibit many of the training (Saxe et al., 2013) and generalization (Advani & Saxe, 2017) complexities of their nonlinear counterparts while simultaneously being more amenable to analysis. The loss surface of a feedforward single-hidden layer linear network with squared error loss was first studied in (Baldi & Hornik, 1989), where the authors demonstrated that there are no non-global minima. Recent work in linear networks has extended that result to multi-hidden layer networks (Zhou & Liang, 2017) and to arbitrary differentiable convex losses (Laurent & von Brecht, 2017). Unlike in nonlinear networks, the identity of every critical point is known in the single-hidden layer, linear case. This provides a ground truth against which numerical results can be compared. Here we study the performance of critical point finding algorithms on a single-hidden layer deep linear autoencoder (DLAE) with sixteen input and output units and four hidden units, applied to Gaussian data with a diagonal covariance matrix and evenly spaced eigenvalues between 1 and 16. What follows is a description of the set of critical points of the loss surface of this network, after (Baldi & Hornik, 1989).

Each critical point of a DLAE with input units and hidden units () corresponds to a projection of the data onto a space spanned by some combination of at most of the eigenvectors of the data covariance matrix. In the network parameterization, this projection matrix (rank ) is factored into the and

input and output weight matrices. This factorization is only unique up to an invertible linear transformation, and so each critical point is part of a disconnected Lie group of critical points.

111

This situation is also partially shared by ReLU networks 

(Freeman & Bruna, 2016).
In the following, we only consider critical points modulo this equivalence relation, up to which the total number of critical points is given by the sum of the numbers of ways to choose from 0 to elements from an element set: . The loss surface of the DLAE we consider therefore contains a single minimum and

saddles. The minimum corresponds to a solution where the network has learned to project onto the four eigenvectors with the largest eigenvalues, and the saddles correspond to all other projections. Note that the single critical point that corresponds to learning none of the eigenvectors, resulting in a parameter vector of all zeros, shall be referred to as the “critical point at zero” later on.

It is possible to construct the weight matrices for the DLAE so as to initialize it to a particular critical point: the input-to-hidden weight matrix is constructed by placing the eigenvectors represented at that critical point in its columns, while the hidden-to-output weight matrix is its transpose. This allows us to compute every element of , up to equivalence, and then compute the true values of, e.g. the loss and index. These values can then be compared to those computed from a subset of , i.e. critical points obtained via numerical methods.

4 Results

To find critical points, we first computed optimization trajectories by training a linear network on Gaussian data with the squared error loss. These optimization trajectories were then used to generate initial points for critical point-finding algorithms. Unless otherwise stated, initial points were selected uniformly with respect to height on the loss surface. Critical point-finding algorithms were terminated either when no proposed step size met acceptance criteria or when a maximum number of epochs was reached. Returned points were accepted as numerical critical points if their squared gradient norm was less than 1e-10.

To optimize the gradient norm objective (Eq. 1), we use batch gradient descent with back-tracking line search using the Wolfe conditions (Wolfe, 1971). We use fast Hessian-vector products (Pearlmutter, 1994) to compute our updates with the AutoGrad (Maclaurin, 2016) Python package.

For both Newton methods, we use the robust linear solver MR-QLP (Choi et al., 2011). See (Roosta et al., 2018) for a succinct explanation of the benefits of this solver for poorly-conditioned, indefinite problems. This solver is also accelerated by the use of fast Hessian-vector products.

See Supplementary Materials for further details and hyperparameter values.

4.1 Newton Methods Outperform Gradient Norm Minimization

-CPs found by all three numerical methods can have the same qualitative and quantitative loss and index values as do true CPs, but the methods have varying efficiencies (Fig. 1). With the selected convergence criteria (see caption), none of the methods find -CPs in regions of the loss-index plane where there are no true CPs (e.g. bad local minima, in the top left quadrant, or low-lying saddles, in the bottom right quadrant). All methods also find subsets of that span the same values of loss and index as does .

Newton-MR (Fig. 1, left column) terminates in fewer iterations than does Newton-TR (Fig. 1, middle column; medians 221 and 430; Mann-Whitney = 4768.5, ). Newton-TR furthermore requires multiple calls to MR-QLP per iteration (one for each choice of trust region size), and so Newton-MR terminates more quickly (26.5 s per 100 iterations for Newton-MR vs 1 min 39 s for NewtonTR, on commodity hardware). A more sophisticated mechanism for determining trust region size, rather than simply selecting the best choice from a pre-defined set, might narrow this performance gap.

Gradient norm minimization (GNM; Fig. 1, right column) is less efficient in two ways. First, successful runs of GNM take on the order of one hundred times as many epochs to reach the same value of the gradient norm as do the Newton methods. Each iteration only requires a single Hessian-vector multiply, unlike the Newton methods, whose calls to MR-QLP require several Hessian-vector multiplies, but the difference in wall time is still more than an order of magnitude (6.8 s per 100 iterations for GNM). Furthermore, as discussed above in Section 2, GNM tends to get stuck in local minima of its objective function, as evidenced by the numerous short orange traces which terminate without the gradient norm going below 1e-10 (62.7% of runs). Because of this, even though the same number of runs and more compute were given to GNM, the number of -CPs recovered is far smaller (cf. Fig. 1, bottom-right and bottom-left panels). Note that local minima of the gradient norm objective do not correspond to local minima, or critical points at all, of the original problem. Thus we conclude that Newton methods in general, and Newton-MR in particular, are a better choice of critical point-finding algorithm for neural network loss surfaces.

4.2 Strict Cutoffs are Necessary to Accurately Recover Critical Points

Figure 2: Cutoffs above 1e-10 are insufficient to guarantee accurate loss and index recovery. As in Fig. 1, -CPs are plotted in green over true CPs in gray. For each panel, points are selected by taking the 150 runs of Newton-MR in Fig. 1 and taking the first point whose squared gradient norm is below the cutoff value, , in the top-left corner. For the bottom-right panel, we choose , which corresponds to accepting the initial point as an -CP.

Here and in previous work, the final output of a critical point-finding algorithm was accepted if the squared gradient norm on termination was below some cutoff value (here, 1e-10). However, at termination, many points may be far below this cutoff. In fact, the vast majority of runs terminate with squared gradient norm close to 1e-30, and so the results in Fig. 1 do not demonstrate that simply having squared gradient norm at the cutoff 1e-10 is sufficient to guarantee a match between the loss and index values of elements of and of . If a sufficient cutoff could be identified, then much iteration time could be saved by terminating runs as soon as the squared gradient norm went below that value.

We found that, while having squared gradient norm equal to our cutoff was sufficient to guarantee accurate recovery of loss and index values, for the cutoff values used in (Dauphin et al., 2014) and (Pennington & Bahri, 2017), it was not (Fig. 2). The error is larger for lower values of the loss. Note, however, that the overall shape of the loss-index relationship is largely preserved for these looser cutoffs. Interestingly, we find that simply computing the loss and index of points along the optimization trajectory results in the same concave-up shape reported in previous work (Dauphin et al., 2014; Pennington & Bahri, 2017) (Fig. 2, bottom-right panel), underscoring the need for care in the selection of convergence criteria.

4.3 Adding Noise Reduces Sampling Bias

Figure 3: Additive noise with the correct magnitude reduces sampling bias. Top left panel: number of times an -CP that performed projection onto a given eigenvector was found. Colors differentiate runs seeded from different optimization trajectories. Eigenvectors are numbered in order of increasing eigenvalue. Bars outlined in black indicate number times the critical point at zero was found. Top right panel: entropy of distribution of eigenvector IDs (as in top left panel) for

-CPs found by adding Gaussian noise with variance

to points sampled from optimization trajectories prior to executing a critical point-finding algorithm (here, Newton-MR). Error bars indicate standard deviation, computed from bootstrapped samples (

). Bottom row: loss and index values recovered when adding noise as in top right panel, along with log-scaled histograms of index values. The values found by executing the same algorithm from the same trajectory without noise are indicated with empty black circles.

As discussed above, neither sampling uniformly from the trajectory nor from height on the loss surface guarantee an unbiased sample of the set . Indeed, the sample of critical points found using a single optimization trajectory as a seed using a Newton method is heavily biased (Fig. 3, top left panel). First, samples from all trajectories evinced a bias towards critical points that perform projection onto eigenvectors with large eigenvalue. Second, individual trajectories are similarly biased towards a few other, seemingly random eigenvectors (e.g. 10 and 12 for the trajectory in green; 9 for the trajectory in orange). We quantified this bias by taking the distribution of targeted eigenvectors (plotted as histograms in Fig. 3, top left panel) and computing the entropy (Fig. 3, top right panel).

To reduce this sampling bias, (Dauphin et al., 2014) proposed adding noise to the values sampled from the trajectory. We investigated whether this approach worked with Gaussian noise. All results presented in this section were obtained using Newton-MR. Results for Newton-TR were qualitatively similar. We found that bias was partially reduced for appropriate choices of noise variance (Fig. 3, top right panel and bottom row). For noise with low variance (Fig. 3, bottom-left-most panel), there was no substantial change in the either the entropy or the identity of the recovered critical points. For noise with high variance (Fig. 3, bottom-right-most panel), adding noise did not increase the entropy and amplified the bias towards critical points projecting onto eigenvectors with large eigenvalues. This result is somewhat curious, since the loss values of the noise-corrupted points are higher, not lower, than the originals. It indicates that Newton methods do not always converge to points nearby in loss value. For intermediate values of noise variance (Fig. 3, bottom-center panel), adding noise reduced the bias, as quantified by the entropy (bootstrap () means 3.05 bits, no noise, and 3.34 bits, ; compared with Student’s : , ), and resulted in a closer match between the distributions of the indices of the true and computed CPs (see histograms in Fig. 3, bottom row). This value of corresponded to a signal-to-noise ratio of 2.8 dB, compared to SNRs of 6.8 dB and 0.8 dB in the other cases. This suggests that, to optimally reduce bias, added noise must be of sufficient magnitude to perturb the parameters but not of lower magnitude than the parameters themselves.

4.4 Sampling Bias is Worse when Initializing Uniformly Across Iterations

Figure 4: Uniformly sampling across iterates increases the sampling bias of critical point-finding methods. Top left panel: as in Fig. 3, but for critical point-finding algorithms initialized from uniformly chosen iterates of the seed optimization trajectory. Top right panel: as in Fig. 3, comparing entropy across noiseless and noise-added versions of both forms of initial point sampling. Bottom row: as in Fig. 3.

Previous work sampled initial points for critical point-finding algorithms from optimization trajectories either uniformly across iterations (Dauphin et al., 2014) or uniformly in height on the loss surface (Pennington & Bahri, 2017). Note that all results reported above used the latter method. We compared these two methods and found that sampling uniformly across iterations resulted in a heavy bias towards critical points projecting onto dominant eigenvectors (Fig. 4, top left and bottom left panels). We quantified this with the entropy of the distribution of eigenvectors onto which critical points projected, as above, and found that the entropy was significantly lower when sampling uniformly across iterations (Fig. 4, top right panel). Adding Gaussian noise to the sampled values had no effect for small values and did not fully counteract the reduction in entropy for intermediate values (Fig. 4, bottom row; uniform height vs uniform trajectory: means bits, bits, , ; uniform height vs uniform trajectory, : means bits, bits, , ).

5 Discussion

We examined the performance of three methods to numerically find the critical points of neural network loss surfaces on a test problem for which the ground truth critical points are known. In the absence of this ground truth and in the presence of numerical and convergence concerns, the fidelity of reported critical points to true critical points is unknown. We found that, while all three methods are capable of finding the critical points of a deep linear autoencoder, the Newton method-based algorithms outperform direct gradient norm minimization (GNM). As predicted by theory, GNM frequently (on 62.7% of runs) gets stuck in local minima and requires two orders of magnitude more iterations to converge, likely due to extremely poor conditioning.

We identified a recently-proposed Newton method, Newton-MR, as a promising candidate algorithm and found that it produced solutions with fewer iterations and in less wall time than the trust region Newton method used in (Dauphin et al., 2014). The possible applicability of quasi-Newton methods (Liu & Nocedal, 1989) for large problems remains unexplored and is left to future investigation. Newton-based homotopy methods (Mehta et al., 2015) are another promising future direction, based on preliminary results in (Coetzee & Stonick, 1997).

Numerically-recovered critical points rarely have gradient norm exactly zero, and the appropriate cutoff value for faithfully representing the loss and index values of the true critical points of a neural network is unknown. Our results suggest that for precise recovery of loss and index values, this cutoff in the squared norm should be stricter (1e-10) than previously reported (1e-06 in (Pennington & Bahri, 2017), unreported in (Dauphin et al., 2014)), presuming that the relevant Lipschitz constants for nonlinear networks are at least as large as those for linear networks.

Given the infeasibility of calculating all of the critical points of the loss surface, the goal of critical point-finding methods should be to produce an unbiased picture of the true critical points. We find that previously reported sampling methods based on optimization trajectories are biased towards critical points at low values of the loss and towards a random other subset of critical points, possibly determined by which critical points were closest to the optimization trajectory. Adding noise was insufficient to remove this bias, but did mitigate it slightly. Interestingly, we found that adding larger amounts of noise actually increased the bias towards critical points at low values of the loss. Further work is needed to explain this phenomenon.

Taken together, our findings suggest that numerically recovering the critical points of nonlinear neural networks with high accuracy is possible, even if the problem of obtaining an even sample is still unsolved. For questions regarding just the parts of the loss surface that a typical training trajectory passes through, relevant for answering questions about optimization, the latter is not such a serious problem. However, it remains problematic for questions about the mathematical properties of the ensemble of critical points. Answering these questions for a wide variety of neural networks can provide insight into which properties of the loss surface are most relevant for neural network training and generalization and by which mechanisms they arise.

Acknowledgements

The authors would like to thank Jesse Livezey, Shariq Mobin, Jascha Sohl-Dickstein, Max Simchowitz, Levent Sagun, Yasaman Bahri, and Jeffrey Pennington for useful discussions. Author CF was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814. Authors CF & KB were funded by a DOE/LBNL LDRD, ‘Deep Learning for Science’, (PI, Prabhat). NW was supported by the Google PhD Fellowship. MRD was supported in part by the U. S. Army Research Laboratory and the U. S. Army Research Office under contract W911NF-13-1-0390.

References

6 Supplementary Material

6.1 Detailed Methods

6.2 Data

The input data to the network was a 10,000 element sample from a 16-dimensional Gaussian with mean zero and diagonal covariance matrix with entries . Because the analytical results (Baldi & Hornik, 1989) were derived for centered data, the data was then zero-centered to floating precision by subtracting the sample mean.

6.3 Network and Training

The linear autoencoder network had 16 input units, 4 hidden units, and 16 output units. Initial parameter values were sampled from a Gaussian with variance equal to the inverse of the number of weights in each weight matrix. The network was then trained for 10,000 epochs of full-batch gradient descent with a fixed learning rate of 0.01. Final parameter values had losses within 5e-6 of the loss of the global minimizer (starting from losses above it by order 1).

6.4 Back-Tracking Line Search

The step sizes for minimizing the gradient norm objective (Equation 1) and along the Newton-MR search direction were computed using back-tracking line search. The initial step size was 0.1 and the step size was multiplied by 0.5 when a proposed update failed to meet the update criteria (Wolfe conditions for GNM (Wolfe, 1971), criteria from (Roosta et al., 2018) for Newton-MR). In both cases, the free parameter for the Armijo-type condition was set to 1e-04. In the former, the free parameter for the curvature-type condition was set to 0.9. After a step was accepted, step size was multiplied by 2 and used as the initial step size for the next step. Line search was terminated when multiplying the proposed step size by 0.5 resulted in no change in the value of the step size, in the number type in use. All of our experiments used double-precision floats.

6.5 Newton-TR

For the Newton-TR update (Equation 3), we used 5 evenly log-spaced values of from 1 to 1e-04. We found that smaller ranges reduced convergence. We used a step size of 0.1.

6.6 Calculating Index and Identifying Critical Points

To calculate index numerically, a minimum negative eigenvalue tolerance must be set. We chose 1e-05, the square root of our criterion for the squared gradient norm.

Critical points were identified as follows. First the un-factorized linear map that the network applies was calculated by multiplying together the input and output weights. Near a critical point, this map should be strongly diagonally dominant, thanks to the diagonal structure of the data’s true covariance matrix. We found that simply identifying which diagonal elements were above 0.9 was sufficient to determine critical point identity: doing so did not result in identifying putative critical points that performed projection onto more than 4 eigenvectors.