Empirical success of deep neural networks has sparked great interest in the theory of deep models. From an optimization viewpoint, the biggest mystery is that deep neural networks are successfully trained by gradient-based algorithms despite their nonconvexity. On the other hand, it has been known that training neural networks to global optimality is NP-hard. It is also known that even checking local optimality of nonconvex problems can be NP-hard . Bridging this gap between theory and practice is a very active area of research.
There have been many attempts to understand why optimization works well for neural networks, by studying the loss surface [1, 27, 10, 24, 17, 18, 21, 12, 28, 29, 30, 26, 23] and the role of (stochastic) gradient-based methods [25, 4, 8]. Besides nonconvexity, for ReLU networks significant additional challenges in the analysis arise due to nondifferentiability, and obtaining a precise understanding of the nondifferentiable points is still elusive.
Nondifferentiable points lie in a set of measure zero, so one may be tempted to overlook them as “non-generic.” However, when studying critical points we cannot do so, as they are precisely such “non-generic” points. Laurent and von Brecht 
study one-hidden-layer ReLU networks with hinge loss and note that except for piecewise constant regions, local minima always occur on nonsmooth boundaries. Probably due to difficulty in analysis, there have not been other works that handle such nonsmooth points of losses and prove results that work forall points. Some theorems [24, 18] hold “almost surely”; some assume differentiability or make statements only for differentiable points [17, 28]; others analyze population risk, in which case the nondifferentiability disappears after taking expectation [25, 4, 8, 21, 26].
1.1 Summary of our results
In this paper, we take a step towards understanding nondifferentiable points of the empirical risk of one-hidden-layer ReLU(-like) networks. Specifically, we provide a theoretical algorithm that tests second-order stationarity (SOS) for any point of the loss surface. It takes an input point and returns:
The point is a local minimum; or
The point is a second-order stationary point (SOSP); or
A descent direction in which the function value strictly decreases.
Therefore, we can test whether a given point is a SOSP. If not, the test extracts a guaranteed direction of descent that helps continue minimization; or even escape from saddle points! What makes it nontrivial is that unlike Hessian based methods for escaping saddles, we do not have differentiability.
The key computational challenge in constructing our algorithm for nondifferentiable points is posed by data points that lie on the “boundary” of a hidden neuron. Since each such data point bisects the parameter space into two halfspaces with different “slopes” of the loss surface, one runs into nondifferentiability. If there aresuch boundary data points, then in the worst case the parameter space divides into regions, so naively testing each region will be very inefficient. In our algorithm, we overcome this issue by a clever use of polyhedral geometry. Another challenge comes from the second-order test, which involves solving nonconvex QPs. Although QP is NP-hard in general , we prove that the QPs in our algorithm are still solved efficiently in most cases. We further describe the challenges and key ideas in Section 2.1.
Many practitioners of deep learning rely on first-order methods, without good termination criteria for the optimization problem. Our algorithm proposes a tool for improvement: with a proper numerical implementation (although we leave numerical implementation to future work), it can test whether a given point is a SOSP, or extract a descent direction using second-order information. One can imagine running a first-order method until it “gets stuck,” then using our algorithm to test SOS or escape from the saddle. This idea of mixing first and second-order methods has been explored in differentiable problems[6, 20, 15].
For a vector, denotes its -th component, and denotes a semi-norm where is a positive semidefinite matrix. Given a matrix , we let , , and be ’s -th entry, the -th row, and the -th column, respectively.
2 Problem setting and key ideas
We consider a one-hidden-layer neural network with input dimension , hidden layer width , and output dimension . We are given pairs of data points and labels , where and . Given an input vector , the output of the network is defined as where , , , and
are the network parameters. The activation functionis “ReLU-like,” meaning , where and . Note that ReLU and Leaky-ReLU are members of this class. In training neural networks, we are interested in minimizing the empirical risk
over the parameters , where
is the loss function. We make the following assumptions on the loss function and the training dataset: The loss functionis twice continuously differentiable and convex in . No
data points lie on the same affine hyperplane. Assumption2 is satisfied by many standard loss functions such as squared error loss and cross-entropy loss. Assumption 2 means, if for example, no three data points are on the same line. Since real-world datasets contain noise, this assumption is also quite mild.
2.1 Challenges and key ideas
In this section, we explain the difficulties at nondifferentiable points and ideas on overcoming them. Our algorithm is built from first principles, rather than advanced tools from nonsmooth analysis.
Bisection by boundary data points.
Since the activation function is nondifferentiable at , the behavior of data points at the “boundary” is decisive. Consider a simple example , so is a row vector. If , then the sign of for any small perturbations and stays invariant. In contrast, when there is a point on the “boundary,” i.e., , then the slope depends on the direction of perturbation, leading to nondifferentiability. We refer to such points as boundary data points. When ,
and similarly, the slope is for . This means that the “gradient” (as well as higher order derivatives) of depends on direction of .
Thus, every boundary data point defines a hyperplane through the origin, and bisects the parameter space into two halfspaces. The situation is even worse if we have boundary data points: they lead to a worst case of regions. Does it mean that we need to test all regions separately? We show that there is a way to remedy this problem, but before that, we first describe how to test local minimality or stationarity for each region.
Second-order local optimality conditions.
We can expand and obtain the following Taylor-like expansion for small enough perturbations (see Lemma 3 for details)
where is a vectorized version of all parameters and is the corresponding vector of perturbations. Notice now that in (1), at nondifferentiable points the usual Taylor expansion does not exist, but the corresponding “gradient” and “Hessian” now depend on the direction of perturbation . Also, the space of is divided into regions, and and are piecewise-constant functions of , constant inside each region. One could view this problem as constrained optimization problems and try to solve for KKT conditions at ; however, we provide an approach that is developed from first principles and solves all problems efficiently.
Given this expansion (1) and the observation that derivatives stay invariant with respect to scaling of , one can note that (a) for all , and (b) for all such that are necessary conditions for local optimality of , thus is a “SOSP” (see Definition 2.2). The conditions become sufficient if (b) is replaced with for all such that . In fact, this is a generalized version of second-order necessary (or sufficient) conditions, i.e., and (or ), for differentiable .
Efficiently testing SOSP for exponentially many regions.
Motivated from the second-order expansion (1) and necessary/sufficient conditions, our algorithm consists of three steps:
Testing first-order stationarity (in the Clarke sense, see Definition 2.2),
Testing for all ,
Testing for .
The tests are executed from Step a to c. Whenever a test fails, we get a strict descent direction , and the algorithm returns and terminates. Below, we briefly outline each step and discuss how we can efficiently perform the tests. We first check first-order stationarity because it makes Step b easier. Step a is done by solving one convex QP per each hidden node. For Step b
, we formulate linear programs (LPs) per eachregion, so that checking whether all LPs have minimum cost of zero is equivalent to checking for all . Here, the feasible sets of LPs are pointed polyhedral cones, whereby it suffices to check only the extreme rays of the cones. It turns out that there are only extreme rays, each shared by cones, so testing can be done with only inequality/equality tests instead of solving exponentially many LPs. In Step b, we also record the flat extreme rays, i.e., those with , for later use in Step c.
In Step c, we test if the second-order perturbation can be negative, for directions where . Due to the constraint , the second-order test requires solving constrained nonconvex QPs. In case where there is no flat extreme ray, we need to solve only one equality constrained QP (ECQP). If there exist flat extreme rays, a few more inequality constrained QPs (ICQPs) are solved. Despite NP-hardness of general QPs , we prove that the specific form of QPs in our algorithm are still tractable in most cases. More specifically, we prove that projected gradient descent on ECQPs converges/diverges exponentially fast, and each step takes time ( is the number of parameters). In case of ICQPs, it takes time to solve the QP, where is the number of boundary data points that have flat extreme rays (). Here, we can see that if is small enough, the ICQP can still be solved in polynomial time in . At the end of the paper, we provide empirical evidences that the number of flat extreme rays is zero or very few, meaning that in most cases we can solve the QP efficiently.
2.2 Problem-specific notation and definition
In this section, we define a more precise notion of generalized stationary points and introduce some additional symbols that will be helpful in streamlining the description of our algorithm in Section 3. Since we are dealing with nondifferentiable points of nonconvex , usual notions of (sub)gradients do not work anymore. Here, Clarke subdifferential is a useful generalization : [FOSP, Theorem 6.2.5 of Borwein and Lewis ] Suppose that a function is locally Lipschitz around the point , and differentiable in where has Lebesgue measure zero. Then the Clarke differential of at is
If , we say is a first-order stationary point (FOSP). From the definition, we can note that Clarke subdifferential is the convex hull of all the possible values of in (1). For parameters , let and be the Clarke differential w.r.t. to and , respectively. They are the projection of onto the space of individual parameters. Whenever the point is clear (e.g. our algorithm), we will omit from . Next, we define second-order stationary points for the empirical risk . Notice that this generalizes the definition of SOSP for differentiable functions : and . [SOSP] We call is a second-order stationary point (SOSP) of if (1) is a FOSP, (2) for all , and (3) for all such that .
Given an input data point , we define to be the output of hidden layer. We note that the notation is overloaded with the big-Oh notation, but their meaning will be clear from the context. Consider perturbing parameters with , then the perturbed output of the network and the amount of perturbation can be expressed as
where can be thought informally as the “Jacobian” matrix of the hidden layer. The matrix is diagonal, and its -th diagonal entry is given by
where is the derivative of . We define , which is okay because it is always multiplied with zero in our algorithm. For boundary data points, depends on the direction of perturbations , as noted in Section 2.1. We additionally define and to separate the terms in that are linear in perturbations versus quadratic in perturbations.
For simplicity of notation for the rest of the paper, we define for all ,
In our algorithm and its analysis, we need to give a special treatment to the boundary data points. To this end, for each node in the hidden layer, define boundary index set as
The subspace spanned by vectors for in plays an important role in our tests; so let us define a symbol for it, as well as the cardinality of and their sum:
For , let be the -th row of , and be the -th column of . Next, we define the total number of parameters , and vectorized perturbations :
Also let be vectorized parameters , packed in the same order as .
Define a matrix . This quantity appears multiplie times and does not depend on the perturbation, so it is helpful to have a simple symbol for it.
3 Test algorithm for second-order stationarity
In this section, we present SOSP-Check in Algorithm 1, which takes an arbitrary tuple of parameters as input and checks whether it is a SOSP. We first present a lemma that shows the explicit form of the perturbed empirical risk and identify first and second-order perturbations. The proof is deferred to Appendix B.2. For small enough perturbation ,
where and satisfy
and . Also, and are piece-wise constant functions of , which are constant inside each polyhedral cone in space of .
Rough pseudocode of SOSP-Check is presented in Algorithm 1. As described in Section 2.1, the algorithm consists of three steps: (a) testing first-order stationarity (b) testing for all , and (c) testing for . If the input point satisfies the second-order sufficient conditions for local minimality, the algorithm decides it is a local minimum. If the point only satisfies second-order necessary conditions, it returns SOSP. If a strict descent direction is found, the algorithm terminates immediately and returns . A brief description will follow, but the full algorithm (Algorithm 2) and a full proof of correctness are deferred to Appendix A.
Test for and is more difficult because depends on and when there are boundary data points. For each , Line 5 (if ), and Line 9 (if ) test if is in . Note from Definition 2.2 and Lemma 3 that is the convex hull of all possible values of . If , can be tested by solving a convex QP:
Linear program formulation.
Every bisects into two halfspaces, and , in each of which stays constant. Note that by Lemma 2.2, ’s for are linearly independent. So, given boundary data points, they divide the space of into polyhedral cones.
Since is constant in each polyhedral cones, we can let for all , and define an LP for each :
Solving these LPs and checking if the minimum value is suffices to prove for all small enough perturbations. The constraint is there because any is also orthogonal to . It is equivalent to linearly independent equality constraints. So, the feasible set of LP (3) has linearly independent constraints, which implies that the feasible set is a pointed polyhedral cone with vertex at origin. Since any point in a pointed polyhedral cone is a conical combination (linear combination with nonnegative coefficients) of extreme rays of the cone, checking nonnegativity of the objective function for all extreme rays suffices. We emphasize that we do not solve the LPs (3) in our algorithm; we just check the extreme rays.
Extreme rays of a pointed polyhedral cone in are computed from linearly independent active constraints. For each , the extreme ray must be tested whether , in both directions. Note that there are extreme rays, and one extreme ray is shared by polyhedral cones. Moreover, for , which indicates that
regardless of . Testing an extreme ray can be done with a single inequality test instead of separate tests for all cones! Thus, this extreme ray approach instead of solving individual LPs greatly reduces computation, from to .
Testing extreme rays.
For the details of testing all possible extreme rays, please refer to FO-Increasing-Test (Algorithm 4) and Appendix A.2. FO-Increasing-Test computes all possible extreme rays and tests if they satisfy . If the inequality is not satisfied by an extreme ray , then this is a descent direction, so we return . If the inequality holds with equality, it means this is a flat extreme ray, and it needs to be checked in second-order test, so we save this extreme ray for future use.
How many flat extreme rays () are there? Presence of flat extreme rays introduce inequality constraints in the QP that we solve in the second-order test. It is ideal not to have them, because in this case there are only equality constraints, so the QP is easier to solve. Lemma A.2 in Appendix A.2 shows the conditions for having flat extreme rays; in short, there is a flat extreme ray if or or . For more details, please refer to Appendix A.2.
The second-order test checks for “flat” ’s satisfying . This is done with help of the function SO-Test (Algorithm 5). Given its input , it defines fixed “Jacobian” matrices for all data points and equality/inequality constraints for boundary data points, and solves the QP of the following form:
Constraints and number of QPs.
There are equality constraints of the form . These equality constraints are due to the nonnegative homogeneous property of activation ; i.e., scaling and by and scaling by yields exactly the same network. So, these equality constraints force to be orthogonal to the loss-invariant directions. This observation is stated more formally in Lemma A.3, which as a corollary shows that any differentiable FOSP of always has rank-deficient Hessian. The other constraints make sure that the union of feasible sets of QPs is exactly (please see Lemma A.3 in Appendix A.3 for details). It is also easy to check that these constraints are all linearly independent.
If there is no flat extreme ray, the algorithm solves just one QP with equality constraints. If there are flat extreme rays, the algorithm solves one QP with equality constraints, and more QPs with equality constraints and inequality constraints, where
Efficiency of solving the QPs (4).
Despite NP-hardness of general QPs, our specific form of QPs (4) can be solved quite efficiently, avoiding exponential complexity in . After solving QP (4), there are three (disjoint) termination conditions:
whenever , or
whenever , but such that , or
such that and ,
where is the feasible set of QP. With the following two lemmas, we show that the termination conditions can be efficiently tested for ECQPs and ICQPs. First, the ECQPs can be iteratively solved with projected gradient descent, as stated in the next lemma. Consider the QP, where is symmetric and has full row rank:
Then, projected gradient descent (PGD) updates
with learning rate converges to a solution or diverges to infinity exponentially fast. Moreover, with random initialization, PGD correctly checks conditions a–c with probability . The proof is an extension of unconstrained case , and is deferred to Appendix B.3. Note that it takes time to compute in the beginning, and each update takes time. It is also surprising that the convergence rate does not depend on .
In the presence of flat extreme rays, we have to solve QPs involving inequality constraints. We prove that our ICQP can be solved in time, which implies that as long as the number of flat extreme rays is small, the problem can still be solved in polynomial time in . Consider the QP, where is symmetric, and have full row rank, and has rank :
Then, there exists a method that checks whether a–c in time. In short, we transform to define an equivalent problem, and use classical results in copositive matrices [14, 22, 9]; the problem can be solved by computing the eigensystem of a matrix, and testing copositivity of an matrix. The proof is presented in Appendix B.4.
Concluding the test.
During all calls to SO-Test, whenever any QP terminated with c, then SOSP-Check immediately returns the direction and terminates. After solving all QPs, if any of SO-Test calls finished with b, then we conclude SOSP-Check with “SOSP.” If all QPs terminated with a, then we can return “Local Minimum.”
For experiments, we used artificial datasets sampled iid from standard normal distribution, and trained 1-hidden-layer ReLU networks with squared error loss. In practice, it is impossible to get to the exact nondifferentiable point, because they lie in a set of measure zero. To get close to those points, we ran Adam using full-batch (exact) gradient for 200,000 iterations and decaying step size (start with , decay every 20,000 iterations). We observed that decaying step size had the effect of “descending deeper into the valley.”
After running Adam, for each , we counted the number of approximate boundary data points satisfying
, which gives an estimate of. Moreover, for these points, we solved the QP (2) using L-BFGS-B , to check if the terminated points are indeed (approximate) FOSPs. We could see that the optimal values of (2) are close to zero ( typically, for largest problems). After solving (2), we counted the number of ’s that ended up with or . The number of such ’s is an estimate of . We also counted the number of approximate boundary data points satisfying , for an estimate of .
We ran the above-mentioned experiments for different settings of , 40 times each. We fixed for simplicity. For large , the optimizer converged to near-zero minima, making uniformly small, so it was difficult to obtain accurate estimates of and . Thus, we had to perform experiments in settings where the optimizer converged to minima that are far from zero.
|# Runs||Sum (Avg.)||Sum (Avg.)||Sum (Avg.)|
|40||290 (7.25)||0 (0)||0 (0)||0|
|40||371 (9.275)||1 (0.025)||0 (0)||0.025|
|40||1,452 (36.3)||0 (0)||0 (0)||0|
|40||2,976 (74.4)||2 (0.05)||0 (0)||0.05|
|40||24,805 (620.125)||4 (0.1)||0 (0)||0.1|
|40||14,194 (354.85)||0 (0)||0 (0)||0|
|40||42,334 (1,058.35)||37 (0.925)||1 (0.025)||0.625|
Table 1 summarizes the results. Through 280 runs, we observed that there are surprisingly many boundary data points () in general, but usually there are zero or very few (maximum was 3) flat extreme rays (). This observation suggests two important messages: (1) many local minima are on nondifferentiable points, which is the reason why our analysis is meaningful; (2) luckily, is usually very small, so we only need to solve ECQPs () or ICQPs with very small number of inequality constraints, which are solved efficiently (Lemmas 3.3 and 3.3). We can observe that , , and indeed increase as model dimensions and training set get larger, but the rate of increase is not as fast as , , and .
For our experiments, we used artificial regression datasets instead of real ones. This is because popular real datasets are classification datasets, but cross-entropy loss gets arbitrarily small as parameters tend to infinity and hinge loss is not differentiable; the losses are not adequate for our purposes.
5 Discussion and Future Work
We provided an algorithm to test second-order stationarity and escape saddle points, for nondifferentiable points of empirical risk of shallow ReLU-like networks. Despite difficulty raised by boundary data points dividing the parameter space into regions, we reduced the computation to convex QPs, equality/inequality tests, and one (or a few more) QP. In benign cases, the last QP is equality constrained, which can be efficiently solved with projected gradient descent. In the worst case, the QP has a few (say ) inequality constraints, but it can be solved efficiently when is small. We also provided empirical evidences that is usually either zero or very small, suggesting that the test can be done efficiently in most cases. Extending this test to deeper neural networks is a possible future work. Also, numerical implementation of this algorithm and combining it with practical gradient-based methods will be of great interest.
Baldi and Hornik 
P. Baldi and K. Hornik.
Neural networks and principal component analysis: Learning from examples without local minima.Neural networks, 2(1):53–58, 1989.
- Blum and Rivest  A. Blum and R. L. Rivest. Training a 3-node neural network is np-complete. In Proceedings of the 1st International Conference on Neural Information Processing Systems, pages 494–501. MIT Press, 1988.
- Borwein and Lewis  J. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media, 2010.
Brutzkus and Globerson 
A. Brutzkus and A. Globerson.
Globally optimal gradient descent for a convnet with gaussian inputs.
International Conference on Machine Learning, pages 605–614, 2017.
- Byrd et al.  R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
- Carmon et al.  Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756, 2016.
- Clarke et al.  F. H. Clarke, Y. S. Ledyaev, R. J. Stern, and P. R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.
- Du et al.  S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
- Hiriart-Urruty and Seeger  J.-B. Hiriart-Urruty and A. Seeger. A variational approach to copositive matrices. SIAM review, 52(4):593–629, 2010.
- Kawaguchi  K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
- Kingma and Ba  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Laurent and von Brecht  T. Laurent and J. von Brecht. The multilinear structure of relu networks. arXiv preprint arXiv:1712.10132, 2017.
- Lee et al.  J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
- Martin and Jacobson  D. H. Martin and D. H. Jacobson. Copositive matrices and definiteness of quadratic forms subject to homogeneous linear inequality constraints. Linear Algebra and its Applications, 35:227–258, 1981.
- Mokhtari et al.  A. Mokhtari, A. Ozdaglar, and A. Jadbabaie. Escaping saddle points in constrained optimization. arXiv preprint arXiv:1809.02162, 2018.
- Murty and Kabadi  K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.
- Nguyen and Hein [2017a] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2603–2612, 2017a.
- Nguyen and Hein [2017b] Q. Nguyen and M. Hein. Optimization landscape and expressivity of deep cnns. arXiv preprint arXiv:1710.10928, 2017b.
Pardalos and Vavasis 
P. M. Pardalos and S. A. Vavasis.
Quadratic programming with one negative eigenvalue is np-hard.Journal of Global Optimization, 1(1):15–22, 1991.
- Reddi et al.  S. J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, and A. J. Smola. A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434, 2017.
- Safran and Shamir  I. Safran and O. Shamir. Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
- Seeger  A. Seeger. Eigenvalue analysis of equilibrium processes defined by linear complementarity conditions. Linear Algebra and its Applications, 292(1-3):1–14, 1999.
- Shamir  O. Shamir. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739, 2018.
- Soudry and Carmon  D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Tian  Y. Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, pages 3404–3413, 2017.
- Wu et al.  C. Wu, J. Luo, and J. D. Lee. No spurious local minima in a two hidden unit relu network. In International Conference on Learning Representations Workshop, 2018.
Yu and Chen 
X.-H. Yu and G.-A. Chen.
On the local minima free condition of backpropagation learning.IEEE Transactions on Neural Networks, 6(5):1300–1303, 1995.
- Yun et al. [2018a] C. Yun, S. Sra, and A. Jadbabaie. Spurious local minima in neural networks: a critical view. arXiv preprint arXiv:1802.03487, 2018a.
- Yun et al. [2018b] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. In International Conference on Learning Representations, 2018b.
- Zhou and Liang  Y. Zhou and Y. Liang. Critical points of neural networks: Analytical forms and landscape properties. In International Conference on Learning Representations, 2018.
Appendix A Full algorithms and proof of correctness
where and satisfy
If we apply perturbation where , we can immediately check that and . So,
and also that . Then, by scaling sufficiently small we can achieve , which disproves that is a local minimum.
Test for and is more difficult because depends on and when there are boundary data points. Recall that () is the -th row of . Then note from Lemma 3 that