1 Introduction
One of the biggest mysteries of deep learning is why neural networks are successfully trained in practice using gradientbased methods, despite the inherent nonconvexity of the associated optimization problem. For example, nonconvex problems can have poor local minima, which will cause any local search method (and in particular, gradientbased ones) to fail. Thus, it is natural to ask what types of assumptions, in the context of training neural networks, might mitigate such problems. For example, recent work has shown that other nonconvex learning problems, such as phase retrieval, matrix completion, dictionary learning, and tensor decomposition, do not have spurious local minima under suitable assumptions, in which case local search methods have a chance of succeeding (e.g.,
(Ge et al., 2015; Sun et al., 2015; Ge et al., 2016; Bhojanapalli et al., 2016)). Is it possible to prove similar positive results for neural networks?In this paper, we focus on perhaps the simplest nontrivial ReLU neural networks, namely predictors of the form
for some , where is the ReLU function, is a vector in , and are parameter vectors. We consider directly optimizing the expected squared loss, where the input is standard Gaussian, and in the realizable case – namely, that the target values are generated by a network of a similar architecture:
(1) 
Note that here, the choice (for all and any permutation ) is a global minimum with zero expected loss. Several recent papers analyzed such objectives, in the hope of showing that it does not suffer from spurious local minima (see related work below for more details).
Our main contribution is to prove that unfortunately, this conjecture is false, and that Eq. (1) indeed has spurious local minima once . Moreover, this is true even if the dimension is unrestricted, and even if we assume that
are orthonormal vectors. In fact, since in high dimensions randomlychosen vectors are approximately orthogonal, and the landscape of the objective function is robust to small perturbations, we can show that spurious local minima exist for
nearly all neural network problems as in Eq. (1), in high enough dimension (with respect to, say, a Gaussian distribution over
). Moreover, we show experimentally that these local minima are not pathological, and that standard gradient descent can easily get trapped in them, with a probability which seems to increase towards with the network size.Our proof technique is a bit unorthodox. Although it is possible to write down the gradient of Eq. (1) in closed form (without the expectation), it is not clear how to get analytical expressions for its roots, and hence characterize the stationary points of Eq. (1). As far as we know, an analytical expression for the roots might not even exist. Instead, we employed the following strategy: We ran standard gradient descent with random initialization on the objective function, until we reached a point which is both suboptimal (function value being significantly higher than ); approximate stationary (gradient norm very close to
); and with a strictly positive definite Hessian (with minimal eigenvalue significantly larger than
). We use a computer to verify these conditions in a formal manner, avoiding floatingpoint arithmetic and the possibility of rounding errors. Relying on these numbers, we employ a Taylor expansion argument, to show that we must have arrived at a point very close to a local (nonglobal) minimum of Eq. (
1), hence establishing the existence of such minima.On the more positive side, we show that an additional overparameterization assumption appears to be very effective in mitigating these local minima issues: Namely, we use a network larger than that needed with unbounded computational power, and replace Eq. (1) with
(2) 
where . In our experiments with up to size , we observe that whereas leads to plenty of local minima, leads to much fewer local minima, whereas no local minima were encountered once (although those might still exist for larger values of than those we tried). Thus, although Eq. (1) has local minima, we conjecture that Eq. (2) might still be proven to have no bad local minima, but this would necessarily require to be sufficiently larger than .
The paper is structured as follows: After surveying related work below, we provide our main results and proof ideas in Sec. 2. Sec. 3 provide additional experimental details about the local minima found, as well empirical evidence about the likelihood of reaching them using gradient descent. Detailed proofs are in Sec. 4.
1.1 Related Work
There is a large and rapidly increasing literature on the optimization theory of neural networks, surveying all of which is well outside our scope. Thus, in this subsection, we only briefly survey the works most relevant to ours.
We begin by noting that when minimizing the average loss over some arbitrary finite dataset, it is easy to construct problems where even for a single neuron (
in Eq. (1)), there are many spurious local minima (e.g., (Auer et al., 1996; Swirszcz et al., 2016)). Moreover, the probability of starting at a basin of such local minima is exponentially high in the dimension (Safran and Shamir, 2016). On the other hand, it is known that if the network is overparameterized, and large enough compared to the data size, then there are no local minima (Poston et al., 1991; Livni et al., 2014; Haeffele and Vidal, 2015; Zhang et al., 2016; Soudry and Carmon, 2016; Soltanolkotabi et al., 2017; Nguyen and Hein, 2017; Boob and Lan, 2017). In any case, neither these positive nor negative results apply here, as we are interested in the expected (population) loss with respect to the Gaussian distribution, which is of course nondiscrete. Also, several recent works have studied learning neural networks under a Gaussian distribution assumption (e.g., Janzamin et al. (2015); Brutzkus and Globerson (2017); Du et al. (2017); Li and Yuan (2017); Feizi et al. (2017); Zhang et al. (2017); Ge et al. (2017)), but using a network architecture different than ours, or focusing on algorithms rather than the geometry of the optimization problem. Finally, Shamir (2016) provide hardness results for training neural networks even under distributional assumptions, but these do not apply when making strong assumptions on both the input distribution and the network generating the data, as we do here.For Eq. (1), a few works have shown that there are no spurious local minima, or that gradient descent will succeed in reaching a global minimum, provided the vectors are in general position or orthogonal (Zhong et al., 2017; Soltanolkotabi et al., 2017; Tian, 2017). However, these results either apply only to , assume the algorithm is initialized close to a global optimum, or analyze the geometry of the problem only on some restricted subset of the parameter space.
The empirical observation that gradientbased methods may not work well on Eq. (1) has been made in Livni et al. (2014), and more recently in Ge et al. (2017). Moreover, Livni et al. (2014) empirically observed that overparameterization seems to help. However, our focus here is to prove the existence of such local minima, as well as more precisely quantify their behavior as a function of the network sizes.
2 Main Result and Proof Technique
Before we begin, a small note on terminology: When referring to local minima of a function on Euclidean space, we always mean spurious local minima (i.e., points such that for all in some open neighborhood of ).
Our basic result is the following:
Theorem 1.
Consider the optimization problem
where are orthogonal unit vectors in . Then for , as well as , the objective function above has spurious local minima.
Remark 1.
For smaller than , we were unable to find local minima using our proof technique, since gradient descent always seemed to converge to a global minimum. Also, although we have verified the theorem only up to , the result strongly suggests that there are local minima for larger values as well. See Sec. 3 for some examples of the local minima found.
The theorem assumes a fixed input dimension, and a particular choice of . However, these assumptions are not necessary and can be relaxed, as demonstrated by the following corollary:
Corollary 1.
Thm. 1 also applies if the space is replaced by for any (with distributed as a standard Gaussian in that space). Moreover, if are chosen i.i.d. from a Gaussian distribution (for any ), the theorem still holds with probability at least .
Remark 2.
The corollary is not specific to a Gaussian distribution over , and can be generalized to any distribution for which are approximately orthogonal and of the same norm in high dimensions (see below for details).
We now turn to explain how these results are derived, starting with Thm. 1. In what follows, we let be the vector of parameters, and let be the objective function defined in Thm. 1 (assuming are fixed). We will also assume that is thricedifferentiable in a neighborhood of (which will be shown to be true as part of our proofs), with a gradient and a Hessian .
Clearly, a global minimum of is obtained by for all (and otherwise), in which case attains a global minimum of . Thus, to prove Thm. 1, it is sufficient to find a point such that , , and . The major difficulty is showing the existence of points where the first condition is fulfilled: Gradient descent allows us to find points where , but it is very unlikely to return a point where exactly. Instead, we use a Taylorexpansion argument (detailed below), to show that if we found a point such that is sufficiently close to , as well as and , then must be close to a local minimum.
The secondorder Taylor expansion of a multivariate, thricedifferentiable function about a point , in a direction given by a unit vector and using a Lagrange remainder term, is given by
for some , and where denotes the th coordinate of . Denoting the remainder term as , we have
(3) 
Now, suppose that the point we obtain by gradient descent satisfies , and (for some positive ), uniformly for all unit vectors . Fix some and let . By the Taylor expansion above, this implies that for any and all unit ,
An elementary calculation reveals that the term is strictly positive for any
(and in particular, in the closed interval of ). Letting , and assuming , we get that there is some small closed ball of radius centered at (and with boundary ), such that
Moreover, since is continuous, it is minimized over at some point . But then
(4) 
so must reside in the interior of . Thus, it is minimal in an open neighborhood containing it, hence it is a local minimum. Overall, we have arrived at the following key lemma:
Lemma 1.
Assume that for some , it holds that and
let denote the smallest eigenvalue of , and let
If and then the function contains a local minimum, within a distance of at most from .
The only missing element is that the local minimum might be a global minimum. To rule this out, one can simply use the fact that is a Lipschitz function, so that if is much larger than , the neighboring local minimum can’t have a value of , and hence cannot be global:
Lemma 2.
The formal proof of this lemma appears in Subsection 4.1.4.
Most of the technical proof of Thm. 1 consists in rigorously verifying the conditions of Lemma 1 and Lemma 2. A major hurdle is that floatingpoint calculations are not guaranteed to be accurate (due to the possibility of roundoff and other errors), so for a formal proof, one needs to use software that comes with guaranteed numerical accuracy. In our work, we chose to use variable precision arithmetic (VPA), a standard package of MATLAB which is based on symbolic arithmetic, and allows performing elementary numerical computations with an arbitrary number of guaranteed digits of precision. The main technical issue we faced is that some calculations are not easily done with a few elementary arithmetical operations (in particular, the standard way to compute would be via a spectral decomposition of the Hessian matrix). The bulk of the proof consists of showing how we bound the quantities relevant to Lemma 1 in an elementary manner.
Finally, we turn to discuss how Corollary 1 is proven, given Thm. 1 (see Subsection 4.2 for a more formal derivation). The proof idea is that the objective does not have any “nontrivial” structure outside the span of . Therefore, if we take a local minima for
, and pad it with
zeros, we get a point in for which the gradient’s norm is unchanged, the Hessian has the same spectrum for any , and the third derivatives are still bounded. Hence, that point is a local minimum in the higherdimensional problem as well. As to the second part of the corollary, the only property of the Gaussian distribution we need is that in high dimensions, if we sample , then we are overwhelmingly likely to get approximately orthogonal vectors with approximately the same norm. Hence, up to rotation and scaling, we get a small perturbation of the objective considered in Thm. 1. Moreover, for large enough , we can make the perturbation arbitrarily small, uniformly in some compact domain. Now, recall that we prove the existence of some local minimum , by showing that in some small sphere enclosing . If the perturbations are small enough, we also have , which by arguments similar to before, imply that is close to a local minimum of .3 Experiments
So far, our technique proves the existence of local minima for the objective function in Eq. (2). However, this does not say anything about the likelihood of gradient descent to reach them. We now turn to study this question empirically.
For each value of , where and , we ran 1000 instantiations of gradient descent on the objective in Eq. (2), each starting from a different random initialization^{1}^{1}1We used standard Xavier initialization: Each weight vector was samples i.i.d. from a Gaussian distribution in , with zero mean and covariance .. Each instantiation was ran with a fixed step size of , until reaching a candidate stationary point / local minima (the stopping criterion was that the gradient norm w.r.t. any is at most ). Points obtaining objective values less than were ignored as those are likely to be close to a global minimum. Interestingly, no points with value between and were found. For all remaining candidate points, we verified that the conditions in Lemmas 1 and 2 are met^{2}^{2}2 Since running our algorithm for all suspicious points found on all architectures is time consuming, we instead identified points that are equivalent up to permutations on the order of neurons and of the data coordinates, since the objective is invariant under such permutations. By bounding the maximal Euclidean distance between these points and using the Lipschitzness of the objective and its Hessian (see Thm. 4 and Lemma 6), this allowed us to run the algorithm on a single representative from a family of equivalent points and speed up the running time drastically. Also, the objective was tested to be thricedifferentiable in all enclosing balls of radii returned by the algorithm. Specifically, we ensured that no two such balls intersect (which results in two identical neurons, where the objective is not thricedifferentiable) and that no ball contains the origin (which results in a neuron with weight , where again the objective is not thricedifferentiable). to conclude that these points are indeed close to spurious local minima (in all cases, the distance turned out to be less than ). Our verification process included verifying thricedifferentiability in the enclosing balls containing the minimum by asserting they contain no singular points, hence the objective is an analytical expression when restricted to these balls where differentiability follows.
In Tables 2 and 2, we summarize the percentage of instantiations which were verified to converge close to a spurious local minimum, as a function of . We note that among candidate points found, only a tiny fraction could not be verified to be local minima (this only occured for network sizes , and consist only of the instantiations respectively). In the tables, we also provide the minimal eigenvalue of the Hessian of the objective, and the objective value (or equivalently, the optimization error) at the points found, averaged over the instantiations^{3}^{3}3Since all points are extremely close to a local minimum, the objective at the minimum is essentially the same, up to a deviation on order less than . Also, the minimal eigenvalues vary by at most .. Note that since the minimal eigenvalue is strictly positive and varies slightly inside the enclosing ball, this indicates that these are in fact strict local minima. As the tables demonstrate, the probability of converging to a spurious local minimum increases rapidly with , and suggests that it eventually become overwhelming as long as . However, on a positive note, mild overparameterization seems to remedy this, as no local minima were found for where , and local minima for are much more scarce than for . We leave the investigation of local minima for larger values of to future work.
k  n  % of runs  Average  Average 

converging to  minimal  objective  
local minima  eigenvalue  value  
6  6  0.3%  0.0047  0.025 
7  7  5.5%  0.014  0.023 
8  8  12.6%  0.021  0.021 
9  9  21.8%  0.027  0.02 
10  10  34.6%  0.03  0.022 
11  11  45.5%  0.034  0.022 
12  12  58.5%  0.035  0.021 
13  13  73%  0.037  0.022 
14  14  73.6%  0.038  0.023 
15  15  80.3%  0.038  0.024 
16  16  85.1%  0.038  0.027 
17  17  89.7%  0.039  0.027 
18  18  90%  0.039  0.029 
19  19  93.4%  0.038  0.031 
20  20  94%  0.038  0.033 
k  n  % of runs  Average  Average 

converging to  minimal  objective  
local minima  eigenvalue  value  
8  9  0.1%  0.0059  0.021 
10  11  0.1%  0.0057  0.018 
11  12  0.1%  0.0056  0.017 
12  13  0.3%  0.0054  0.016 
13  14  1.5%  0.0015  0.038 
14  15  5.5%  0.002  0.033 
15  16  10.1%  0.004  0.032 
16  17  18%  0.0055  0.031 
17  18  20.9%  0.007  0.031 
18  19  36.9%  0.0064  0.028 
19  20  49.1%  0.0077  0.027 
In Fig. 1, we show the distribution of the objective values obtained in the points found, over the 1000 instantiations of several architectures. The figure clearly indicates that apart from a higher chance of converging to local minima, larger architectures also tend to have worse values attained on these minima.
Finally, in examples 1 and 2 below, we present some specific local minima found for and , and discuss their properties. We note that these are the smallest networks (with and respectively) for which we were able to find such points.
Example 1.
Out of 1000 gradient descent instantiations for , three converged close to a local minimum. All three were verified to be essentially identical (after permuting the neurons and up to an Euclidean distance of ), and have the following form:
where the parameter vector of each of the neurons corresponds to a column of . The Hessian of the objective at , , was confirmed to have minimal eigenvalue . This implied that all three suspicious points found for are of distance at most from a local minimum with objective value at least .
Example 2.
Out of 1000 gradient descent initializations for , one converged to a local minimum. The point found, denoted , is given below:
where the parameter vector of each of the neurons corresponds to a column of . The Hessian of the objective at , , was confirmed to have minimal eigenvalue . This implied that is of distance at most from a local minimum with objective value at least .
It is interesting to note that the points found in examples 1 and 2, as well as all other local minima detected, have a nice symmetric structure: We see that most of the trained neurons are very close to the target neurons in most of the dimensions. Also, many of the entries appear to be the same. Surprisingly, although such constructions might seem brittle, these are indeed strict local minima. Moreover, the probability of converging to such points becomes very large as the network size increases as demonstrated by our experiments.
4 Proofs
In the proofs, we use boldfaced letters (e.g., ) to denote vectors, barred boldfaced letters (e.g., ) to denote vectors normalized to unit Euclidean norm, and capital letters to generally denote matrices. Given a natural number , we let be shorthand for . Given a matrix , denotes its spectral norm. We will also make use of the following version of Weyl’s inequality, stated below for completeness.
Theorem 2 (Weyl’s inequality).
Suppose are real symmetric matrices such that . Assume that have eigenvalues respectively, and that . Then
4.1 Proof of Thm. 1
To prove Thm. 1 for some , it is enough to consider some particular choice of orthogonal , since any other choice amounts to rotating or reflecting the same objective function (which of course does not change the existence or nonexistence of its local minima). In particular, we chose these vectors to simply be the standard basis vectors in .
As we show in Subsection 4.1.1 below, the objective function in Eq. (2) can be written in an explicit form (without the expectation term), as well as its gradients and Hessians. We first ran standard gradient descent, starting from random initialization and using a fixed step size of , till we reached a point , such that the gradient norm w.r.t. any is at most . Given this point, we use Lemma 1 and Lemma 2 to prove that it is close to a local minimum. Specifically, we built code which does the following:

Provide a rigorous upper bound on the norm of the gradient at a given point (since we have a closedform expression for the gradient, this only requires elementary calculations).

Provide a rigorous lower bound on the minimal eigenvalue of : This is the technically most demanding part, and the derivation of the algorithm is presented in Subsection 4.1.2.

Provide a rigorous upper bound on the remainder term (see Subsection 4.1.3 for the relevant calculations).
We used MATLAB (version 2017b) to perform all floatingpoint computations, and its associated MATLAB VPA package to perform the exact symbolic computations. The code we used is freely available at https://github.com/ItaySafran/OneLayerGDconvergence.git. For any candidate local minimum, the verification took from less than a minute up to a few hours, depending on the size of , when running on Intel Xeon E5 processors (ranging from E52430 to E52660).
4.1.1 Closedform Expressions for and
For convenience, we will now state closedform expressions (without an expectation) for the objective function in Eq. (2), its gradient and its Hessian. These are also the expressions used in the code we built to verify the conditions of Lemma 1 and Lemma 2. First, we have that
(6) 
where
(7) 
and is the angle between two vectors . The latter equality in Eq. (7) was shown in Cho and Saul (2009, section 2).
Using the above representation, Brutzkus and Globerson (2017) compute the gradient of with respect to , given by
(8) 
Which implies that , the gradient of the objective with respect to , equals
where equals on entries through , and zero elsewhere. We now provide the Hessian of Eq. (7) based on the computation of the gradient in Eq. (8) (see Subsection 4.3.1 for the full derivation)
where
(9) 
To formally define the Hessian of (a matrix), we partition it into blocks, each of size . Define to equal on the th diagonal block and zero elsewhere. For define to equal on the th block and zero elsewhere. We now have that the Hessian is given by
(10) 
4.1.2 Lower bound on
We wish to verify that the Hessian of a point returned by the gradient descent algorithm is positive definite, as well as provide a lower bound for its smallest eigenvalue, avoiding the possibility of errors due to floatingpoint computations.
Since the Hessians we encounter have relatively small entries and are wellconditioned, it turns out that computing the spectral decomposition in floatingpoint arithmetic provides a very good approximation of the true spectrum of the matrix. Therefore, instead of performing spectral decomposition symbolically from scratch, our algorithmic approach is to use the floatingpoint decomposition, and merely bound its error, using simple quantities which are easy to compute symbolically. Specifically, given the (floatingpoint, possibly approximate) decomposition of a matrix , we bound the error using the distance of from , as well as the distance of from its projection on the subspace of orthogonal matrices given by . Formally, we use the following algorithm (where numerical computations refer to operations in floatingpoint arithmetic):
Algorithm analysis: For the purpose of analyzing the algorithm, the following two lemmas will be used.
Lemma 3.
For any natural we have
Proof.
Clearly, for any we have
(11) 
Using the generalized binomial theorem, we have for any
(12) 
Consider the th coefficient in the expansion of the square of Eq. (12), which is well defined as the sum converges absolutely for any . From Eq. (11), these coefficients are all . However, these are also given by the expansion of the square of Eq. (12). Specifically, the th coefficient in the square is given as the sum of all coefficients in the expansion of the root, that is, it is a convolution of the coefficients in Eq. (12) with index , thus we have
∎
Lemma 4.
Let be a diagonally dominant matrix, let satisfying . Then . Moreover, satisfies .
Proof.
Turning back to the algorithm analysis, we wish to numerically compute the eigenvalues of and bound their deviation due to roundoff errors. Other than the inaccuracy in computing , another obstacle is that is not exactly orthogonal, however it is very close to orthogonal in the sense that has a small norm. Let be the projection of onto the space of orthogonal matrices in . Clearly, is well defined if is diagonallydominant, hence positivedefinite, which can be easily verified. Also,
We now upper bound , where and therefore its spectrum is given to us explicitly as the diagonal entries of , . Compute
Estimating the spectrum of
Comments
There are no comments yet.