One of the biggest mysteries of deep learning is why neural networks are successfully trained in practice using gradient-based methods, despite the inherent non-convexity of the associated optimization problem. For example, non-convex problems can have poor local minima, which will cause any local search method (and in particular, gradient-based ones) to fail. Thus, it is natural to ask what types of assumptions, in the context of training neural networks, might mitigate such problems. For example, recent work has shown that other non-convex learning problems, such as phase retrieval, matrix completion, dictionary learning, and tensor decomposition, do not have spurious local minima under suitable assumptions, in which case local search methods have a chance of succeeding (e.g.,(Ge et al., 2015; Sun et al., 2015; Ge et al., 2016; Bhojanapalli et al., 2016)). Is it possible to prove similar positive results for neural networks?
In this paper, we focus on perhaps the simplest non-trivial ReLU neural networks, namely predictors of the form
for some , where is the ReLU function, is a vector in , and are parameter vectors. We consider directly optimizing the expected squared loss, where the input is standard Gaussian, and in the realizable case – namely, that the target values are generated by a network of a similar architecture:
Note that here, the choice (for all and any permutation ) is a global minimum with zero expected loss. Several recent papers analyzed such objectives, in the hope of showing that it does not suffer from spurious local minima (see related work below for more details).
Our main contribution is to prove that unfortunately, this conjecture is false, and that Eq. (1) indeed has spurious local minima once . Moreover, this is true even if the dimension is unrestricted, and even if we assume that
are orthonormal vectors. In fact, since in high dimensions randomly-chosen vectors are approximately orthogonal, and the landscape of the objective function is robust to small perturbations, we can show that spurious local minima exist fornearly all neural network problems as in Eq. (1
), in high enough dimension (with respect to, say, a Gaussian distribution over). Moreover, we show experimentally that these local minima are not pathological, and that standard gradient descent can easily get trapped in them, with a probability which seems to increase towards with the network size.
Our proof technique is a bit unorthodox. Although it is possible to write down the gradient of Eq. (1) in closed form (without the expectation), it is not clear how to get analytical expressions for its roots, and hence characterize the stationary points of Eq. (1). As far as we know, an analytical expression for the roots might not even exist. Instead, we employed the following strategy: We ran standard gradient descent with random initialization on the objective function, until we reached a point which is both suboptimal (function value being significantly higher than ); approximate stationary (gradient norm very close to
); and with a strictly positive definite Hessian (with minimal eigenvalue significantly larger than
). We use a computer to verify these conditions in a formal manner, avoiding floating-point arithmetic and the possibility of rounding errors. Relying on these numbers, we employ a Taylor expansion argument, to show that we must have arrived at a point very close to a local (non-global) minimum of Eq. (1), hence establishing the existence of such minima.
On the more positive side, we show that an additional over-parameterization assumption appears to be very effective in mitigating these local minima issues: Namely, we use a network larger than that needed with unbounded computational power, and replace Eq. (1) with
where . In our experiments with up to size , we observe that whereas leads to plenty of local minima, leads to much fewer local minima, whereas no local minima were encountered once (although those might still exist for larger values of than those we tried). Thus, although Eq. (1) has local minima, we conjecture that Eq. (2) might still be proven to have no bad local minima, but this would necessarily require to be sufficiently larger than .
The paper is structured as follows: After surveying related work below, we provide our main results and proof ideas in Sec. 2. Sec. 3 provide additional experimental details about the local minima found, as well empirical evidence about the likelihood of reaching them using gradient descent. Detailed proofs are in Sec. 4.
1.1 Related Work
There is a large and rapidly increasing literature on the optimization theory of neural networks, surveying all of which is well outside our scope. Thus, in this subsection, we only briefly survey the works most relevant to ours.
We begin by noting that when minimizing the average loss over some arbitrary finite dataset, it is easy to construct problems where even for a single neuron (in Eq. (1)), there are many spurious local minima (e.g., (Auer et al., 1996; Swirszcz et al., 2016)). Moreover, the probability of starting at a basin of such local minima is exponentially high in the dimension (Safran and Shamir, 2016). On the other hand, it is known that if the network is over-parameterized, and large enough compared to the data size, then there are no local minima (Poston et al., 1991; Livni et al., 2014; Haeffele and Vidal, 2015; Zhang et al., 2016; Soudry and Carmon, 2016; Soltanolkotabi et al., 2017; Nguyen and Hein, 2017; Boob and Lan, 2017). In any case, neither these positive nor negative results apply here, as we are interested in the expected (population) loss with respect to the Gaussian distribution, which is of course non-discrete. Also, several recent works have studied learning neural networks under a Gaussian distribution assumption (e.g., Janzamin et al. (2015); Brutzkus and Globerson (2017); Du et al. (2017); Li and Yuan (2017); Feizi et al. (2017); Zhang et al. (2017); Ge et al. (2017)), but using a network architecture different than ours, or focusing on algorithms rather than the geometry of the optimization problem. Finally, Shamir (2016) provide hardness results for training neural networks even under distributional assumptions, but these do not apply when making strong assumptions on both the input distribution and the network generating the data, as we do here.
For Eq. (1), a few works have shown that there are no spurious local minima, or that gradient descent will succeed in reaching a global minimum, provided the vectors are in general position or orthogonal (Zhong et al., 2017; Soltanolkotabi et al., 2017; Tian, 2017). However, these results either apply only to , assume the algorithm is initialized close to a global optimum, or analyze the geometry of the problem only on some restricted subset of the parameter space.
The empirical observation that gradient-based methods may not work well on Eq. (1) has been made in Livni et al. (2014), and more recently in Ge et al. (2017). Moreover, Livni et al. (2014) empirically observed that over-parameterization seems to help. However, our focus here is to prove the existence of such local minima, as well as more precisely quantify their behavior as a function of the network sizes.
2 Main Result and Proof Technique
Before we begin, a small note on terminology: When referring to local minima of a function on Euclidean space, we always mean spurious local minima (i.e., points such that for all in some open neighborhood of ).
Our basic result is the following:
Consider the optimization problem
where are orthogonal unit vectors in . Then for , as well as , the objective function above has spurious local minima.
For smaller than , we were unable to find local minima using our proof technique, since gradient descent always seemed to converge to a global minimum. Also, although we have verified the theorem only up to , the result strongly suggests that there are local minima for larger values as well. See Sec. 3 for some examples of the local minima found.
The theorem assumes a fixed input dimension, and a particular choice of . However, these assumptions are not necessary and can be relaxed, as demonstrated by the following corollary:
Thm. 1 also applies if the space is replaced by for any (with distributed as a standard Gaussian in that space). Moreover, if are chosen i.i.d. from a Gaussian distribution (for any ), the theorem still holds with probability at least .
The corollary is not specific to a Gaussian distribution over , and can be generalized to any distribution for which are approximately orthogonal and of the same norm in high dimensions (see below for details).
We now turn to explain how these results are derived, starting with Thm. 1. In what follows, we let be the vector of parameters, and let be the objective function defined in Thm. 1 (assuming are fixed). We will also assume that is thrice-differentiable in a neighborhood of (which will be shown to be true as part of our proofs), with a gradient and a Hessian .
Clearly, a global minimum of is obtained by for all (and otherwise), in which case attains a global minimum of . Thus, to prove Thm. 1, it is sufficient to find a point such that , , and . The major difficulty is showing the existence of points where the first condition is fulfilled: Gradient descent allows us to find points where , but it is very unlikely to return a point where exactly. Instead, we use a Taylor-expansion argument (detailed below), to show that if we found a point such that is sufficiently close to , as well as and , then must be close to a local minimum.
The second-order Taylor expansion of a multivariate, thrice-differentiable function about a point , in a direction given by a unit vector and using a Lagrange remainder term, is given by
for some , and where denotes the -th coordinate of . Denoting the remainder term as , we have
Now, suppose that the point we obtain by gradient descent satisfies , and (for some positive ), uniformly for all unit vectors . Fix some and let . By the Taylor expansion above, this implies that for any and all unit ,
An elementary calculation reveals that the term is strictly positive for any
(and in particular, in the closed interval of ). Letting , and assuming , we get that there is some small closed ball of radius centered at (and with boundary ), such that
Moreover, since is continuous, it is minimized over at some point . But then
so must reside in the interior of . Thus, it is minimal in an open neighborhood containing it, hence it is a local minimum. Overall, we have arrived at the following key lemma:
Assume that for some , it holds that and
let denote the smallest eigenvalue of , and let
If and then the function contains a local minimum, within a distance of at most from .
The only missing element is that the local minimum might be a global minimum. To rule this out, one can simply use the fact that is a Lipschitz function, so that if is much larger than , the neighboring local minimum can’t have a value of , and hence cannot be global:
Under the conditions of Lemma 1, if it also holds that
then the local minimum is non-global.
The formal proof of this lemma appears in Subsection 4.1.4.
Most of the technical proof of Thm. 1 consists in rigorously verifying the conditions of Lemma 1 and Lemma 2. A major hurdle is that floating-point calculations are not guaranteed to be accurate (due to the possibility of round-off and other errors), so for a formal proof, one needs to use software that comes with guaranteed numerical accuracy. In our work, we chose to use variable precision arithmetic (VPA), a standard package of MATLAB which is based on symbolic arithmetic, and allows performing elementary numerical computations with an arbitrary number of guaranteed digits of precision. The main technical issue we faced is that some calculations are not easily done with a few elementary arithmetical operations (in particular, the standard way to compute would be via a spectral decomposition of the Hessian matrix). The bulk of the proof consists of showing how we bound the quantities relevant to Lemma 1 in an elementary manner.
Finally, we turn to discuss how Corollary 1 is proven, given Thm. 1 (see Subsection 4.2 for a more formal derivation). The proof idea is that the objective does not have any “non-trivial” structure outside the span of . Therefore, if we take a local minima for
, and pad it withzeros, we get a point in for which the gradient’s norm is unchanged, the Hessian has the same spectrum for any , and the third derivatives are still bounded. Hence, that point is a local minimum in the higher-dimensional problem as well. As to the second part of the corollary, the only property of the Gaussian distribution we need is that in high dimensions, if we sample , then we are overwhelmingly likely to get approximately orthogonal vectors with approximately the same norm. Hence, up to rotation and scaling, we get a small perturbation of the objective considered in Thm. 1. Moreover, for large enough , we can make the perturbation arbitrarily small, uniformly in some compact domain. Now, recall that we prove the existence of some local minimum , by showing that in some small sphere enclosing . If the perturbations are small enough, we also have , which by arguments similar to before, imply that is close to a local minimum of .
So far, our technique proves the existence of local minima for the objective function in Eq. (2). However, this does not say anything about the likelihood of gradient descent to reach them. We now turn to study this question empirically.
For each value of , where and , we ran 1000 instantiations of gradient descent on the objective in Eq. (2), each starting from a different random initialization111We used standard Xavier initialization: Each weight vector was samples i.i.d. from a Gaussian distribution in , with zero mean and covariance .. Each instantiation was ran with a fixed step size of , until reaching a candidate stationary point / local minima (the stopping criterion was that the gradient norm w.r.t. any is at most ). Points obtaining objective values less than were ignored as those are likely to be close to a global minimum. Interestingly, no points with value between and were found. For all remaining candidate points, we verified that the conditions in Lemmas 1 and 2 are met222 Since running our algorithm for all suspicious points found on all architectures is time consuming, we instead identified points that are equivalent up to permutations on the order of neurons and of the data coordinates, since the objective is invariant under such permutations. By bounding the maximal Euclidean distance between these points and using the Lipschitzness of the objective and its Hessian (see Thm. 4 and Lemma 6), this allowed us to run the algorithm on a single representative from a family of equivalent points and speed up the running time drastically. Also, the objective was tested to be thrice-differentiable in all enclosing balls of radii returned by the algorithm. Specifically, we ensured that no two such balls intersect (which results in two identical neurons, where the objective is not thrice-differentiable) and that no ball contains the origin (which results in a neuron with weight , where again the objective is not thrice-differentiable). to conclude that these points are indeed close to spurious local minima (in all cases, the distance turned out to be less than ). Our verification process included verifying thrice-differentiability in the enclosing balls containing the minimum by asserting they contain no singular points, hence the objective is an analytical expression when restricted to these balls where differentiability follows.
In Tables 2 and 2, we summarize the percentage of instantiations which were verified to converge close to a spurious local minimum, as a function of . We note that among candidate points found, only a tiny fraction could not be verified to be local minima (this only occured for network sizes , and consist only of the instantiations respectively). In the tables, we also provide the minimal eigenvalue of the Hessian of the objective, and the objective value (or equivalently, the optimization error) at the points found, averaged over the instantiations333Since all points are extremely close to a local minimum, the objective at the minimum is essentially the same, up to a deviation on order less than . Also, the minimal eigenvalues vary by at most .. Note that since the minimal eigenvalue is strictly positive and varies slightly inside the enclosing ball, this indicates that these are in fact strict local minima. As the tables demonstrate, the probability of converging to a spurious local minimum increases rapidly with , and suggests that it eventually become overwhelming as long as . However, on a positive note, mild over-parameterization seems to remedy this, as no local minima were found for where , and local minima for are much more scarce than for . We leave the investigation of local minima for larger values of to future work.
|k||n||% of runs||Average||Average|
|k||n||% of runs||Average||Average|
In Fig. 1, we show the distribution of the objective values obtained in the points found, over the 1000 instantiations of several architectures. The figure clearly indicates that apart from a higher chance of converging to local minima, larger architectures also tend to have worse values attained on these minima.
Finally, in examples 1 and 2 below, we present some specific local minima found for and , and discuss their properties. We note that these are the smallest networks (with and respectively) for which we were able to find such points.
Out of 1000 gradient descent instantiations for , three converged close to a local minimum. All three were verified to be essentially identical (after permuting the neurons and up to an Euclidean distance of ), and have the following form:
where the parameter vector of each of the neurons corresponds to a column of . The Hessian of the objective at , , was confirmed to have minimal eigenvalue . This implied that all three suspicious points found for are of distance at most from a local minimum with objective value at least .
Out of 1000 gradient descent initializations for , one converged to a local minimum. The point found, denoted , is given below:
where the parameter vector of each of the neurons corresponds to a column of . The Hessian of the objective at , , was confirmed to have minimal eigenvalue . This implied that is of distance at most from a local minimum with objective value at least .
It is interesting to note that the points found in examples 1 and 2, as well as all other local minima detected, have a nice symmetric structure: We see that most of the trained neurons are very close to the target neurons in most of the dimensions. Also, many of the entries appear to be the same. Surprisingly, although such constructions might seem brittle, these are indeed strict local minima. Moreover, the probability of converging to such points becomes very large as the network size increases as demonstrated by our experiments.
In the proofs, we use bold-faced letters (e.g., ) to denote vectors, barred bold-faced letters (e.g., ) to denote vectors normalized to unit Euclidean norm, and capital letters to generally denote matrices. Given a natural number , we let be shorthand for . Given a matrix , denotes its spectral norm. We will also make use of the following version of Weyl’s inequality, stated below for completeness.
Theorem 2 (Weyl’s inequality).
Suppose are real symmetric matrices such that . Assume that have eigenvalues respectively, and that . Then
4.1 Proof of Thm. 1
To prove Thm. 1 for some , it is enough to consider some particular choice of orthogonal , since any other choice amounts to rotating or reflecting the same objective function (which of course does not change the existence or non-existence of its local minima). In particular, we chose these vectors to simply be the standard basis vectors in .
As we show in Subsection 4.1.1 below, the objective function in Eq. (2) can be written in an explicit form (without the expectation term), as well as its gradients and Hessians. We first ran standard gradient descent, starting from random initialization and using a fixed step size of , till we reached a point , such that the gradient norm w.r.t. any is at most . Given this point, we use Lemma 1 and Lemma 2 to prove that it is close to a local minimum. Specifically, we built code which does the following:
Provide a rigorous upper bound on the norm of the gradient at a given point (since we have a closed-form expression for the gradient, this only requires elementary calculations).
Provide a rigorous lower bound on the minimal eigenvalue of : This is the technically most demanding part, and the derivation of the algorithm is presented in Subsection 4.1.2.
Provide a rigorous upper bound on the remainder term (see Subsection 4.1.3 for the relevant calculations).
We used MATLAB (version 2017b) to perform all floating-point computations, and its associated MATLAB VPA package to perform the exact symbolic computations. The code we used is freely available at https://github.com/ItaySafran/OneLayerGDconvergence.git. For any candidate local minimum, the verification took from less than a minute up to a few hours, depending on the size of , when running on Intel Xeon E5 processors (ranging from E5-2430 to E5-2660).
4.1.1 Closed-form Expressions for and
For convenience, we will now state closed-form expressions (without an expectation) for the objective function in Eq. (2), its gradient and its Hessian. These are also the expressions used in the code we built to verify the conditions of Lemma 1 and Lemma 2. First, we have that
Using the above representation, Brutzkus and Globerson (2017) compute the gradient of with respect to , given by
Which implies that , the gradient of the objective with respect to , equals
To formally define the Hessian of (a matrix), we partition it into blocks, each of size . Define to equal on the -th diagonal block and zero elsewhere. For define to equal on the -th block and zero elsewhere. We now have that the Hessian is given by
4.1.2 Lower bound on
We wish to verify that the Hessian of a point returned by the gradient descent algorithm is positive definite, as well as provide a lower bound for its smallest eigenvalue, avoiding the possibility of errors due to floating-point computations.
Since the Hessians we encounter have relatively small entries and are well-conditioned, it turns out that computing the spectral decomposition in floating-point arithmetic provides a very good approximation of the true spectrum of the matrix. Therefore, instead of performing spectral decomposition symbolically from scratch, our algorithmic approach is to use the floating-point decomposition, and merely bound its error, using simple quantities which are easy to compute symbolically. Specifically, given the (floating-point, possibly approximate) decomposition of a matrix , we bound the error using the distance of from , as well as the distance of from its projection on the subspace of orthogonal matrices given by . Formally, we use the following algorithm (where numerical computations refer to operations in floating-point arithmetic):
Algorithm analysis: For the purpose of analyzing the algorithm, the following two lemmas will be used.
For any natural we have
Clearly, for any we have
Using the generalized binomial theorem, we have for any
Consider the -th coefficient in the expansion of the square of Eq. (12), which is well defined as the sum converges absolutely for any . From Eq. (11), these coefficients are all . However, these are also given by the expansion of the square of Eq. (12). Specifically, the -th coefficient in the square is given as the sum of all coefficients in the expansion of the root, that is, it is a convolution of the coefficients in Eq. (12) with index , thus we have
Let be a diagonally dominant matrix, let satisfying . Then . Moreover, satisfies .
Turning back to the algorithm analysis, we wish to numerically compute the eigenvalues of and bound their deviation due to roundoff errors. Other than the inaccuracy in computing , another obstacle is that is not exactly orthogonal, however it is very close to orthogonal in the sense that has a small norm. Let be the projection of onto the space of orthogonal matrices in . Clearly, is well defined if is diagonally-dominant, hence positive-definite, which can be easily verified. Also,
We now upper bound , where and therefore its spectrum is given to us explicitly as the diagonal entries of , . Compute
Estimating the spectrum of