1 Introduction
Empirical success of deep neural networks has sparked great interest in the theory of deep models. From an optimization viewpoint, the biggest mystery is that deep neural networks are successfully trained by gradientbased algorithms despite their nonconvexity. On the other hand, it has been known that training neural networks to global optimality is NPhard
[2]. It is also known that even checking local optimality of nonconvex problems can be NPhard [16]. Bridging this gap between theory and practice is a very active area of research.There have been many attempts to understand why optimization works well for neural networks, by studying the loss surface [1, 27, 10, 24, 17, 18, 21, 12, 28, 29, 30, 26, 23] and the role of (stochastic) gradientbased methods [25, 4, 8]. Besides nonconvexity, for ReLU networks significant additional challenges in the analysis arise due to nondifferentiability, and obtaining a precise understanding of the nondifferentiable points is still elusive.
Nondifferentiable points lie in a set of measure zero, so one may be tempted to overlook them as “nongeneric.” However, when studying critical points we cannot do so, as they are precisely such “nongeneric” points. Laurent and von Brecht [12]
study onehiddenlayer ReLU networks with hinge loss and note that except for piecewise constant regions, local minima always occur on nonsmooth boundaries. Probably due to difficulty in analysis, there have not been other works that handle such nonsmooth points of losses and prove results that work for
all points. Some theorems [24, 18] hold “almost surely”; some assume differentiability or make statements only for differentiable points [17, 28]; others analyze population risk, in which case the nondifferentiability disappears after taking expectation [25, 4, 8, 21, 26].1.1 Summary of our results
In this paper, we take a step towards understanding nondifferentiable points of the empirical risk of onehiddenlayer ReLU(like) networks. Specifically, we provide a theoretical algorithm that tests secondorder stationarity (SOS) for any point of the loss surface. It takes an input point and returns:

The point is a local minimum; or

The point is a secondorder stationary point (SOSP); or

A descent direction in which the function value strictly decreases.
Therefore, we can test whether a given point is a SOSP. If not, the test extracts a guaranteed direction of descent that helps continue minimization; or even escape from saddle points! What makes it nontrivial is that unlike Hessian based methods for escaping saddles, we do not have differentiability.
The key computational challenge in constructing our algorithm for nondifferentiable points is posed by data points that lie on the “boundary” of a hidden neuron. Since each such data point bisects the parameter space into two halfspaces with different “slopes” of the loss surface, one runs into nondifferentiability. If there are
such boundary data points, then in the worst case the parameter space divides into regions, so naively testing each region will be very inefficient. In our algorithm, we overcome this issue by a clever use of polyhedral geometry. Another challenge comes from the secondorder test, which involves solving nonconvex QPs. Although QP is NPhard in general [19], we prove that the QPs in our algorithm are still solved efficiently in most cases. We further describe the challenges and key ideas in Section 2.1.Remarks.
Many practitioners of deep learning rely on firstorder methods, without good termination criteria for the optimization problem. Our algorithm proposes a tool for improvement: with a proper numerical implementation (although we leave numerical implementation to future work), it can test whether a given point is a SOSP, or extract a descent direction using secondorder information. One can imagine running a firstorder method until it “gets stuck,” then using our algorithm to test SOS or escape from the saddle. This idea of mixing first and secondorder methods has been explored in differentiable problems
[6, 20, 15].Notation.
For a vector
, denotes its th component, and denotes a seminorm where is a positive semidefinite matrix. Given a matrix , we let , , and be ’s th entry, the th row, and the th column, respectively.2 Problem setting and key ideas
We consider a onehiddenlayer neural network with input dimension , hidden layer width , and output dimension . We are given pairs of data points and labels , where and . Given an input vector , the output of the network is defined as where , , , and
are the network parameters. The activation function
is “ReLUlike,” meaning , where and . Note that ReLU and LeakyReLU are members of this class. In training neural networks, we are interested in minimizing the empirical riskover the parameters , where
is the loss function. We make the following assumptions on the loss function and the training dataset: The loss function
is twice continuously differentiable and convex in . Nodata points lie on the same affine hyperplane. Assumption
2 is satisfied by many standard loss functions such as squared error loss and crossentropy loss. Assumption 2 means, if for example, no three data points are on the same line. Since realworld datasets contain noise, this assumption is also quite mild.2.1 Challenges and key ideas
In this section, we explain the difficulties at nondifferentiable points and ideas on overcoming them. Our algorithm is built from first principles, rather than advanced tools from nonsmooth analysis.
Bisection by boundary data points.
Since the activation function is nondifferentiable at , the behavior of data points at the “boundary” is decisive. Consider a simple example , so is a row vector. If , then the sign of for any small perturbations and stays invariant. In contrast, when there is a point on the “boundary,” i.e., , then the slope depends on the direction of perturbation, leading to nondifferentiability. We refer to such points as boundary data points. When ,
and similarly, the slope is for . This means that the “gradient” (as well as higher order derivatives) of depends on direction of .
Thus, every boundary data point defines a hyperplane through the origin, and bisects the parameter space into two halfspaces. The situation is even worse if we have boundary data points: they lead to a worst case of regions. Does it mean that we need to test all regions separately? We show that there is a way to remedy this problem, but before that, we first describe how to test local minimality or stationarity for each region.
Secondorder local optimality conditions.
We can expand and obtain the following Taylorlike expansion for small enough perturbations (see Lemma 3 for details)
(1) 
where is a vectorized version of all parameters and is the corresponding vector of perturbations. Notice now that in (1), at nondifferentiable points the usual Taylor expansion does not exist, but the corresponding “gradient” and “Hessian” now depend on the direction of perturbation . Also, the space of is divided into regions, and and are piecewiseconstant functions of , constant inside each region. One could view this problem as constrained optimization problems and try to solve for KKT conditions at ; however, we provide an approach that is developed from first principles and solves all problems efficiently.
Given this expansion (1) and the observation that derivatives stay invariant with respect to scaling of , one can note that (a) for all , and (b) for all such that are necessary conditions for local optimality of , thus is a “SOSP” (see Definition 2.2). The conditions become sufficient if (b) is replaced with for all such that . In fact, this is a generalized version of secondorder necessary (or sufficient) conditions, i.e., and (or ), for differentiable .
Efficiently testing SOSP for exponentially many regions.
Motivated from the secondorder expansion (1) and necessary/sufficient conditions, our algorithm consists of three steps:

Testing firstorder stationarity (in the Clarke sense, see Definition 2.2),

Testing for all ,

Testing for .
The tests are executed from Step a to c. Whenever a test fails, we get a strict descent direction , and the algorithm returns and terminates. Below, we briefly outline each step and discuss how we can efficiently perform the tests. We first check firstorder stationarity because it makes Step b easier. Step a is done by solving one convex QP per each hidden node. For Step b
, we formulate linear programs (LPs) per each
region, so that checking whether all LPs have minimum cost of zero is equivalent to checking for all . Here, the feasible sets of LPs are pointed polyhedral cones, whereby it suffices to check only the extreme rays of the cones. It turns out that there are only extreme rays, each shared by cones, so testing can be done with only inequality/equality tests instead of solving exponentially many LPs. In Step b, we also record the flat extreme rays, i.e., those with , for later use in Step c.In Step c, we test if the secondorder perturbation can be negative, for directions where . Due to the constraint , the secondorder test requires solving constrained nonconvex QPs. In case where there is no flat extreme ray, we need to solve only one equality constrained QP (ECQP). If there exist flat extreme rays, a few more inequality constrained QPs (ICQPs) are solved. Despite NPhardness of general QPs [19], we prove that the specific form of QPs in our algorithm are still tractable in most cases. More specifically, we prove that projected gradient descent on ECQPs converges/diverges exponentially fast, and each step takes time ( is the number of parameters). In case of ICQPs, it takes time to solve the QP, where is the number of boundary data points that have flat extreme rays (). Here, we can see that if is small enough, the ICQP can still be solved in polynomial time in . At the end of the paper, we provide empirical evidences that the number of flat extreme rays is zero or very few, meaning that in most cases we can solve the QP efficiently.
2.2 Problemspecific notation and definition
In this section, we define a more precise notion of generalized stationary points and introduce some additional symbols that will be helpful in streamlining the description of our algorithm in Section 3. Since we are dealing with nondifferentiable points of nonconvex , usual notions of (sub)gradients do not work anymore. Here, Clarke subdifferential is a useful generalization [7]: [FOSP, Theorem 6.2.5 of Borwein and Lewis [3]] Suppose that a function is locally Lipschitz around the point , and differentiable in where has Lebesgue measure zero. Then the Clarke differential of at is
If , we say is a firstorder stationary point (FOSP). From the definition, we can note that Clarke subdifferential is the convex hull of all the possible values of in (1). For parameters , let and be the Clarke differential w.r.t. to and , respectively. They are the projection of onto the space of individual parameters. Whenever the point is clear (e.g. our algorithm), we will omit from . Next, we define secondorder stationary points for the empirical risk . Notice that this generalizes the definition of SOSP for differentiable functions : and . [SOSP] We call is a secondorder stationary point (SOSP) of if (1) is a FOSP, (2) for all , and (3) for all such that .
Given an input data point , we define to be the output of hidden layer. We note that the notation is overloaded with the bigOh notation, but their meaning will be clear from the context. Consider perturbing parameters with , then the perturbed output of the network and the amount of perturbation can be expressed as
where can be thought informally as the “Jacobian” matrix of the hidden layer. The matrix is diagonal, and its th diagonal entry is given by
where is the derivative of . We define , which is okay because it is always multiplied with zero in our algorithm. For boundary data points, depends on the direction of perturbations , as noted in Section 2.1. We additionally define and to separate the terms in that are linear in perturbations versus quadratic in perturbations.
For simplicity of notation for the rest of the paper, we define for all ,
In our algorithm and its analysis, we need to give a special treatment to the boundary data points. To this end, for each node in the hidden layer, define boundary index set as
The subspace spanned by vectors for in plays an important role in our tests; so let us define a symbol for it, as well as the cardinality of and their sum:
For , let be the th row of , and be the th column of . Next, we define the total number of parameters , and vectorized perturbations :
Also let be vectorized parameters , packed in the same order as .
Define a matrix . This quantity appears multiplie times and does not depend on the perturbation, so it is helpful to have a simple symbol for it.
3 Test algorithm for secondorder stationarity
In this section, we present SOSPCheck in Algorithm 1, which takes an arbitrary tuple of parameters as input and checks whether it is a SOSP. We first present a lemma that shows the explicit form of the perturbed empirical risk and identify first and secondorder perturbations. The proof is deferred to Appendix B.2. For small enough perturbation ,
where and satisfy
and . Also, and are piecewise constant functions of , which are constant inside each polyhedral cone in space of .
Rough pseudocode of SOSPCheck is presented in Algorithm 1. As described in Section 2.1, the algorithm consists of three steps: (a) testing firstorder stationarity (b) testing for all , and (c) testing for . If the input point satisfies the secondorder sufficient conditions for local minimality, the algorithm decides it is a local minimum. If the point only satisfies secondorder necessary conditions, it returns SOSP. If a strict descent direction is found, the algorithm terminates immediately and returns . A brief description will follow, but the full algorithm (Algorithm 2) and a full proof of correctness are deferred to Appendix A.
3.1 Testing firstorder stationarity (lines 2, 5, and 9)
Line 2 of Algorithm 1 corresponds to testing if and are singletons with zero. If not, the opposite direction is a descent direction. More details are in Appendix A.1.1.
Test for and is more difficult because depends on and when there are boundary data points. For each , Line 5 (if ), and Line 9 (if ) test if is in . Note from Definition 2.2 and Lemma 3 that is the convex hull of all possible values of . If , can be tested by solving a convex QP:
(2) 
If the solution does not achieve zero objective value, then we can directly return a descent direction. For details please refer to FOSubdiffZeroTest (Algorithm 3) and Appendix A.1.2.
3.2 Testing for all (lines 6–7)
Linear program formulation.
Lines 6–7 are about testing if for all directions of . If , with the solution from QP (2) we can write as
Every bisects into two halfspaces, and , in each of which stays constant. Note that by Lemma 2.2, ’s for are linearly independent. So, given boundary data points, they divide the space of into polyhedral cones.
Since is constant in each polyhedral cones, we can let for all , and define an LP for each :
(3) 
Solving these LPs and checking if the minimum value is suffices to prove for all small enough perturbations. The constraint is there because any is also orthogonal to . It is equivalent to linearly independent equality constraints. So, the feasible set of LP (3) has linearly independent constraints, which implies that the feasible set is a pointed polyhedral cone with vertex at origin. Since any point in a pointed polyhedral cone is a conical combination (linear combination with nonnegative coefficients) of extreme rays of the cone, checking nonnegativity of the objective function for all extreme rays suffices. We emphasize that we do not solve the LPs (3) in our algorithm; we just check the extreme rays.
Computational efficiency.
Extreme rays of a pointed polyhedral cone in are computed from linearly independent active constraints. For each , the extreme ray must be tested whether , in both directions. Note that there are extreme rays, and one extreme ray is shared by polyhedral cones. Moreover, for , which indicates that
regardless of . Testing an extreme ray can be done with a single inequality test instead of separate tests for all cones! Thus, this extreme ray approach instead of solving individual LPs greatly reduces computation, from to .
Testing extreme rays.
For the details of testing all possible extreme rays, please refer to FOIncreasingTest (Algorithm 4) and Appendix A.2. FOIncreasingTest computes all possible extreme rays and tests if they satisfy . If the inequality is not satisfied by an extreme ray , then this is a descent direction, so we return . If the inequality holds with equality, it means this is a flat extreme ray, and it needs to be checked in secondorder test, so we save this extreme ray for future use.
How many flat extreme rays () are there? Presence of flat extreme rays introduce inequality constraints in the QP that we solve in the secondorder test. It is ideal not to have them, because in this case there are only equality constraints, so the QP is easier to solve. Lemma A.2 in Appendix A.2 shows the conditions for having flat extreme rays; in short, there is a flat extreme ray if or or . For more details, please refer to Appendix A.2.
3.3 Testing for (lines 12–17)
The secondorder test checks for “flat” ’s satisfying . This is done with help of the function SOTest (Algorithm 5). Given its input , it defines fixed “Jacobian” matrices for all data points and equality/inequality constraints for boundary data points, and solves the QP of the following form:
(4) 
Constraints and number of QPs.
There are equality constraints of the form . These equality constraints are due to the nonnegative homogeneous property of activation ; i.e., scaling and by and scaling by yields exactly the same network. So, these equality constraints force to be orthogonal to the lossinvariant directions. This observation is stated more formally in Lemma A.3, which as a corollary shows that any differentiable FOSP of always has rankdeficient Hessian. The other constraints make sure that the union of feasible sets of QPs is exactly (please see Lemma A.3 in Appendix A.3 for details). It is also easy to check that these constraints are all linearly independent.
If there is no flat extreme ray, the algorithm solves just one QP with equality constraints. If there are flat extreme rays, the algorithm solves one QP with equality constraints, and more QPs with equality constraints and inequality constraints, where
(5) 
Recall from Section 3.2 that has a flat extreme ray if or or ; thus, . Please refer to Appendix A.3 for more details.
Efficiency of solving the QPs (4).
Despite NPhardness of general QPs, our specific form of QPs (4) can be solved quite efficiently, avoiding exponential complexity in . After solving QP (4), there are three (disjoint) termination conditions:

whenever , or

whenever , but such that , or

such that and ,
where is the feasible set of QP. With the following two lemmas, we show that the termination conditions can be efficiently tested for ECQPs and ICQPs. First, the ECQPs can be iteratively solved with projected gradient descent, as stated in the next lemma. Consider the QP, where is symmetric and has full row rank:
Then, projected gradient descent (PGD) updates
with learning rate converges to a solution or diverges to infinity exponentially fast. Moreover, with random initialization, PGD correctly checks conditions a–c with probability . The proof is an extension of unconstrained case [13], and is deferred to Appendix B.3. Note that it takes time to compute in the beginning, and each update takes time. It is also surprising that the convergence rate does not depend on .
In the presence of flat extreme rays, we have to solve QPs involving inequality constraints. We prove that our ICQP can be solved in time, which implies that as long as the number of flat extreme rays is small, the problem can still be solved in polynomial time in . Consider the QP, where is symmetric, and have full row rank, and has rank :
Then, there exists a method that checks whether a–c in time. In short, we transform to define an equivalent problem, and use classical results in copositive matrices [14, 22, 9]; the problem can be solved by computing the eigensystem of a matrix, and testing copositivity of an matrix. The proof is presented in Appendix B.4.
Concluding the test.
During all calls to SOTest, whenever any QP terminated with c, then SOSPCheck immediately returns the direction and terminates. After solving all QPs, if any of SOTest calls finished with b, then we conclude SOSPCheck with “SOSP.” If all QPs terminated with a, then we can return “Local Minimum.”
4 Experiments
For experiments, we used artificial datasets sampled iid from standard normal distribution, and trained 1hiddenlayer ReLU networks with squared error loss. In practice, it is impossible to get to the exact nondifferentiable point, because they lie in a set of measure zero. To get close to those points, we ran Adam
[11] using fullbatch (exact) gradient for 200,000 iterations and decaying step size (start with , decay every 20,000 iterations). We observed that decaying step size had the effect of “descending deeper into the valley.”After running Adam, for each , we counted the number of approximate boundary data points satisfying
, which gives an estimate of
. Moreover, for these points, we solved the QP (2) using LBFGSB [5], to check if the terminated points are indeed (approximate) FOSPs. We could see that the optimal values of (2) are close to zero ( typically, for largest problems). After solving (2), we counted the number of ’s that ended up with or . The number of such ’s is an estimate of . We also counted the number of approximate boundary data points satisfying , for an estimate of .We ran the abovementioned experiments for different settings of , 40 times each. We fixed for simplicity. For large , the optimizer converged to nearzero minima, making uniformly small, so it was difficult to obtain accurate estimates of and . Thus, we had to perform experiments in settings where the optimizer converged to minima that are far from zero.
# Runs  Sum (Avg.)  Sum (Avg.)  Sum (Avg.)  

40  290 (7.25)  0 (0)  0 (0)  0  
40  371 (9.275)  1 (0.025)  0 (0)  0.025  
40  1,452 (36.3)  0 (0)  0 (0)  0  
40  2,976 (74.4)  2 (0.05)  0 (0)  0.05  
40  24,805 (620.125)  4 (0.1)  0 (0)  0.1  
40  14,194 (354.85)  0 (0)  0 (0)  0  
40  42,334 (1,058.35)  37 (0.925)  1 (0.025)  0.625 
Table 1 summarizes the results. Through 280 runs, we observed that there are surprisingly many boundary data points () in general, but usually there are zero or very few (maximum was 3) flat extreme rays (). This observation suggests two important messages: (1) many local minima are on nondifferentiable points, which is the reason why our analysis is meaningful; (2) luckily, is usually very small, so we only need to solve ECQPs () or ICQPs with very small number of inequality constraints, which are solved efficiently (Lemmas 3.3 and 3.3). We can observe that , , and indeed increase as model dimensions and training set get larger, but the rate of increase is not as fast as , , and .
For our experiments, we used artificial regression datasets instead of real ones. This is because popular real datasets are classification datasets, but crossentropy loss gets arbitrarily small as parameters tend to infinity and hinge loss is not differentiable; the losses are not adequate for our purposes.
5 Discussion and Future Work
We provided an algorithm to test secondorder stationarity and escape saddle points, for nondifferentiable points of empirical risk of shallow ReLUlike networks. Despite difficulty raised by boundary data points dividing the parameter space into regions, we reduced the computation to convex QPs, equality/inequality tests, and one (or a few more) QP. In benign cases, the last QP is equality constrained, which can be efficiently solved with projected gradient descent. In the worst case, the QP has a few (say ) inequality constraints, but it can be solved efficiently when is small. We also provided empirical evidences that is usually either zero or very small, suggesting that the test can be done efficiently in most cases. Extending this test to deeper neural networks is a possible future work. Also, numerical implementation of this algorithm and combining it with practical gradientbased methods will be of great interest.
References

Baldi and Hornik [1989]
P. Baldi and K. Hornik.
Neural networks and principal component analysis: Learning from examples without local minima.
Neural networks, 2(1):53–58, 1989.  Blum and Rivest [1988] A. Blum and R. L. Rivest. Training a 3node neural network is npcomplete. In Proceedings of the 1st International Conference on Neural Information Processing Systems, pages 494–501. MIT Press, 1988.
 Borwein and Lewis [2010] J. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media, 2010.

Brutzkus and Globerson [2017]
A. Brutzkus and A. Globerson.
Globally optimal gradient descent for a convnet with gaussian inputs.
In
International Conference on Machine Learning
, pages 605–614, 2017.  Byrd et al. [1995] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
 Carmon et al. [2016] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. arXiv preprint arXiv:1611.00756, 2016.
 Clarke et al. [2008] F. H. Clarke, Y. S. Ledyaev, R. J. Stern, and P. R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.
 Du et al. [2017] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
 HiriartUrruty and Seeger [2010] J.B. HiriartUrruty and A. Seeger. A variational approach to copositive matrices. SIAM review, 52(4):593–629, 2010.
 Kawaguchi [2016] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
 Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Laurent and von Brecht [2017] T. Laurent and J. von Brecht. The multilinear structure of relu networks. arXiv preprint arXiv:1712.10132, 2017.
 Lee et al. [2016] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
 Martin and Jacobson [1981] D. H. Martin and D. H. Jacobson. Copositive matrices and definiteness of quadratic forms subject to homogeneous linear inequality constraints. Linear Algebra and its Applications, 35:227–258, 1981.
 Mokhtari et al. [2018] A. Mokhtari, A. Ozdaglar, and A. Jadbabaie. Escaping saddle points in constrained optimization. arXiv preprint arXiv:1809.02162, 2018.
 Murty and Kabadi [1987] K. G. Murty and S. N. Kabadi. Some npcomplete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.
 Nguyen and Hein [2017a] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2603–2612, 2017a.
 Nguyen and Hein [2017b] Q. Nguyen and M. Hein. Optimization landscape and expressivity of deep cnns. arXiv preprint arXiv:1710.10928, 2017b.

Pardalos and Vavasis [1991]
P. M. Pardalos and S. A. Vavasis.
Quadratic programming with one negative eigenvalue is nphard.
Journal of Global Optimization, 1(1):15–22, 1991.  Reddi et al. [2017] S. J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, and A. J. Smola. A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434, 2017.
 Safran and Shamir [2017] I. Safran and O. Shamir. Spurious local minima are common in twolayer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
 Seeger [1999] A. Seeger. Eigenvalue analysis of equilibrium processes defined by linear complementarity conditions. Linear Algebra and its Applications, 292(13):1–14, 1999.
 Shamir [2018] O. Shamir. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739, 2018.
 Soudry and Carmon [2016] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
 Tian [2017] Y. Tian. An analytical formula of population gradient for twolayered relu network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, pages 3404–3413, 2017.
 Wu et al. [2018] C. Wu, J. Luo, and J. D. Lee. No spurious local minima in a two hidden unit relu network. In International Conference on Learning Representations Workshop, 2018.

Yu and Chen [1995]
X.H. Yu and G.A. Chen.
On the local minima free condition of backpropagation learning.
IEEE Transactions on Neural Networks, 6(5):1300–1303, 1995.  Yun et al. [2018a] C. Yun, S. Sra, and A. Jadbabaie. Spurious local minima in neural networks: a critical view. arXiv preprint arXiv:1802.03487, 2018a.
 Yun et al. [2018b] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. In International Conference on Learning Representations, 2018b.
 Zhou and Liang [2018] Y. Zhou and Y. Liang. Critical points of neural networks: Analytical forms and landscape properties. In International Conference on Learning Representations, 2018.
Appendix A Full algorithms and proof of correctness
In this section, we present the detailed operation of SOSPCheck (Algorithm 2), and its helper functions FOSubdiffZeroTest, FOIncreasingTest, and SOTest (Algorithm 3–5).
In the subsequent subsections, we provide a more detailed proof of the correctness of Algorithm 2. Recall that, by Lemmas 2.2 and 3, and vectors are linearly independent. Also, we can expand so that
where and satisfy
and .
a.1 Testing firstorder stationarity (lines 2–4, 7–11 and 16–18)
a.1.1 Test of firstorder stationarity for and (lines 2–4)
Lines 2–4 of Algorithm 2 correspond to testing if and . If they are not all zero, the opposite direction is a descent direction, as Line 3 returns. To see why, suppose . Then choose perturbations
If we apply perturbation where , we can immediately check that and . So,
and also that . Then, by scaling sufficiently small we can achieve , which disproves that is a local minimum.
a.1.2 Test of firstorder stationarity for and (lines 7–11 and 16–18)
Test for and is more difficult because depends on and when there are boundary data points. Recall that () is the th row of . Then note from Lemma 3 that
Comments
There are no comments yet.