Efficiently testing local optimality and escaping saddles for ReLU networks

09/28/2018 ∙ by Chulhee Yun, et al. ∙ MIT 18

We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of M data points on the nondifferentiability of the ReLU divides the parameter space into at most 2^M regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, O(M) (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Empirical success of deep neural networks has sparked great interest in the theory of deep models. From an optimization viewpoint, the biggest mystery is that deep neural networks are successfully trained by gradient-based algorithms despite their nonconvexity. On the other hand, it has been known that training neural networks to global optimality is NP-hard 

[2]. It is also known that even checking local optimality of nonconvex problems can be NP-hard [16]. Bridging this gap between theory and practice is a very active area of research.

There have been many attempts to understand why optimization works well for neural networks, by studying the loss surface [1, 27, 10, 24, 17, 18, 21, 12, 28, 29, 30, 26, 23] and the role of (stochastic) gradient-based methods [25, 4, 8]. Besides nonconvexity, for ReLU networks significant additional challenges in the analysis arise due to nondifferentiability, and obtaining a precise understanding of the nondifferentiable points is still elusive.

Nondifferentiable points lie in a set of measure zero, so one may be tempted to overlook them as “non-generic.” However, when studying critical points we cannot do so, as they are precisely such “non-generic” points. Laurent and von Brecht [12]

study one-hidden-layer ReLU networks with hinge loss and note that except for piecewise constant regions, local minima always occur on nonsmooth boundaries. Probably due to difficulty in analysis, there have not been other works that handle such nonsmooth points of losses and prove results that work for

all points. Some theorems [24, 18] hold “almost surely”; some assume differentiability or make statements only for differentiable points [17, 28]; others analyze population risk, in which case the nondifferentiability disappears after taking expectation [25, 4, 8, 21, 26].

1.1 Summary of our results

In this paper, we take a step towards understanding nondifferentiable points of the empirical risk of one-hidden-layer ReLU(-like) networks. Specifically, we provide a theoretical algorithm that tests second-order stationarity (SOS) for any point of the loss surface. It takes an input point and returns:

  1. The point is a local minimum; or

  2. The point is a second-order stationary point (SOSP); or

  3. A descent direction in which the function value strictly decreases.

Therefore, we can test whether a given point is a SOSP. If not, the test extracts a guaranteed direction of descent that helps continue minimization; or even escape from saddle points! What makes it nontrivial is that unlike Hessian based methods for escaping saddles, we do not have differentiability.

The key computational challenge in constructing our algorithm for nondifferentiable points is posed by data points that lie on the “boundary” of a hidden neuron. Since each such data point bisects the parameter space into two halfspaces with different “slopes” of the loss surface, one runs into nondifferentiability. If there are

such boundary data points, then in the worst case the parameter space divides into regions, so naively testing each region will be very inefficient. In our algorithm, we overcome this issue by a clever use of polyhedral geometry. Another challenge comes from the second-order test, which involves solving nonconvex QPs. Although QP is NP-hard in general [19], we prove that the QPs in our algorithm are still solved efficiently in most cases. We further describe the challenges and key ideas in Section 2.1.

Remarks.

Many practitioners of deep learning rely on first-order methods, without good termination criteria for the optimization problem. Our algorithm proposes a tool for improvement: with a proper numerical implementation (although we leave numerical implementation to future work), it can test whether a given point is a SOSP, or extract a descent direction using second-order information. One can imagine running a first-order method until it “gets stuck,” then using our algorithm to test SOS or escape from the saddle. This idea of mixing first and second-order methods has been explored in differentiable problems

[6, 20, 15].

Notation.

For a vector

, denotes its -th component, and denotes a semi-norm where is a positive semidefinite matrix. Given a matrix , we let , , and be ’s -th entry, the -th row, and the -th column, respectively.

2 Problem setting and key ideas

We consider a one-hidden-layer neural network with input dimension , hidden layer width , and output dimension . We are given pairs of data points and labels , where and . Given an input vector , the output of the network is defined as where , , , and

are the network parameters. The activation function

is “ReLU-like,” meaning , where and . Note that ReLU and Leaky-ReLU are members of this class. In training neural networks, we are interested in minimizing the empirical risk

over the parameters , where

is the loss function. We make the following assumptions on the loss function and the training dataset: The loss function

is twice continuously differentiable and convex in . No

data points lie on the same affine hyperplane. Assumption 

2 is satisfied by many standard loss functions such as squared error loss and cross-entropy loss. Assumption 2 means, if for example, no three data points are on the same line. Since real-world datasets contain noise, this assumption is also quite mild.

2.1 Challenges and key ideas

In this section, we explain the difficulties at nondifferentiable points and ideas on overcoming them. Our algorithm is built from first principles, rather than advanced tools from nonsmooth analysis.

Bisection by boundary data points.

Since the activation function is nondifferentiable at , the behavior of data points at the “boundary” is decisive. Consider a simple example , so is a row vector. If , then the sign of for any small perturbations and stays invariant. In contrast, when there is a point on the “boundary,” i.e., , then the slope depends on the direction of perturbation, leading to nondifferentiability. We refer to such points as boundary data points. When ,

and similarly, the slope is for . This means that the “gradient” (as well as higher order derivatives) of depends on direction of .

Thus, every boundary data point defines a hyperplane through the origin, and bisects the parameter space into two halfspaces. The situation is even worse if we have boundary data points: they lead to a worst case of regions. Does it mean that we need to test all regions separately? We show that there is a way to remedy this problem, but before that, we first describe how to test local minimality or stationarity for each region.

Second-order local optimality conditions.

We can expand and obtain the following Taylor-like expansion for small enough perturbations (see Lemma 3 for details)

(1)

where is a vectorized version of all parameters and is the corresponding vector of perturbations. Notice now that in (1), at nondifferentiable points the usual Taylor expansion does not exist, but the corresponding “gradient” and “Hessian” now depend on the direction of perturbation . Also, the space of is divided into regions, and and are piecewise-constant functions of , constant inside each region. One could view this problem as constrained optimization problems and try to solve for KKT conditions at ; however, we provide an approach that is developed from first principles and solves all problems efficiently.

Given this expansion (1) and the observation that derivatives stay invariant with respect to scaling of , one can note that (a) for all , and (b) for all such that are necessary conditions for local optimality of , thus is a “SOSP” (see Definition 2.2). The conditions become sufficient if (b) is replaced with for all such that . In fact, this is a generalized version of second-order necessary (or sufficient) conditions, i.e., and (or ), for differentiable .

Efficiently testing SOSP for exponentially many regions.

Motivated from the second-order expansion (1) and necessary/sufficient conditions, our algorithm consists of three steps:

  1. Testing first-order stationarity (in the Clarke sense, see Definition 2.2),

  2. Testing for all ,

  3. Testing for .

The tests are executed from Step a to c. Whenever a test fails, we get a strict descent direction , and the algorithm returns and terminates. Below, we briefly outline each step and discuss how we can efficiently perform the tests. We first check first-order stationarity because it makes Step b easier. Step a is done by solving one convex QP per each hidden node. For Step b

, we formulate linear programs (LPs) per each

region, so that checking whether all LPs have minimum cost of zero is equivalent to checking for all . Here, the feasible sets of LPs are pointed polyhedral cones, whereby it suffices to check only the extreme rays of the cones. It turns out that there are only extreme rays, each shared by cones, so testing can be done with only inequality/equality tests instead of solving exponentially many LPs. In Step b, we also record the flat extreme rays, i.e., those with , for later use in Step c.

In Step c, we test if the second-order perturbation can be negative, for directions where . Due to the constraint , the second-order test requires solving constrained nonconvex QPs. In case where there is no flat extreme ray, we need to solve only one equality constrained QP (ECQP). If there exist flat extreme rays, a few more inequality constrained QPs (ICQPs) are solved. Despite NP-hardness of general QPs [19], we prove that the specific form of QPs in our algorithm are still tractable in most cases. More specifically, we prove that projected gradient descent on ECQPs converges/diverges exponentially fast, and each step takes time ( is the number of parameters). In case of ICQPs, it takes time to solve the QP, where is the number of boundary data points that have flat extreme rays (). Here, we can see that if is small enough, the ICQP can still be solved in polynomial time in . At the end of the paper, we provide empirical evidences that the number of flat extreme rays is zero or very few, meaning that in most cases we can solve the QP efficiently.

2.2 Problem-specific notation and definition

In this section, we define a more precise notion of generalized stationary points and introduce some additional symbols that will be helpful in streamlining the description of our algorithm in Section 3. Since we are dealing with nondifferentiable points of nonconvex , usual notions of (sub)gradients do not work anymore. Here, Clarke subdifferential is a useful generalization [7]: [FOSP, Theorem 6.2.5 of Borwein and Lewis [3]] Suppose that a function is locally Lipschitz around the point , and differentiable in where has Lebesgue measure zero. Then the Clarke differential of at is

If , we say is a first-order stationary point (FOSP). From the definition, we can note that Clarke subdifferential is the convex hull of all the possible values of in (1). For parameters , let and be the Clarke differential w.r.t. to and , respectively. They are the projection of onto the space of individual parameters. Whenever the point is clear (e.g. our algorithm), we will omit from . Next, we define second-order stationary points for the empirical risk . Notice that this generalizes the definition of SOSP for differentiable functions : and . [SOSP] We call is a second-order stationary point (SOSP) of if (1) is a FOSP, (2) for all , and (3) for all such that .

Given an input data point , we define to be the output of hidden layer. We note that the notation is overloaded with the big-Oh notation, but their meaning will be clear from the context. Consider perturbing parameters with , then the perturbed output of the network and the amount of perturbation can be expressed as

where can be thought informally as the “Jacobian” matrix of the hidden layer. The matrix is diagonal, and its -th diagonal entry is given by

where is the derivative of . We define , which is okay because it is always multiplied with zero in our algorithm. For boundary data points, depends on the direction of perturbations , as noted in Section 2.1. We additionally define and to separate the terms in that are linear in perturbations versus quadratic in perturbations.

For simplicity of notation for the rest of the paper, we define for all ,

In our algorithm and its analysis, we need to give a special treatment to the boundary data points. To this end, for each node in the hidden layer, define boundary index set as

The subspace spanned by vectors for in plays an important role in our tests; so let us define a symbol for it, as well as the cardinality of and their sum:

For , let be the -th row of , and be the -th column of . Next, we define the total number of parameters , and vectorized perturbations :

Also let be vectorized parameters , packed in the same order as .

Define a matrix . This quantity appears multiplie times and does not depend on the perturbation, so it is helpful to have a simple symbol for it.

We conclude this section by presenting one of the implications of Assumption 2 in the following lemma, which we will use later. The proof is simple, and is presented in Appendix B.1. If Assumption 2 holds, then and the vectors are linearly independent.

3 Test algorithm for second-order stationarity

In this section, we present SOSP-Check in Algorithm 1, which takes an arbitrary tuple of parameters as input and checks whether it is a SOSP. We first present a lemma that shows the explicit form of the perturbed empirical risk and identify first and second-order perturbations. The proof is deferred to Appendix B.2. For small enough perturbation ,

where and satisfy

and . Also, and are piece-wise constant functions of , which are constant inside each polyhedral cone in space of .

Rough pseudocode of SOSP-Check is presented in Algorithm 1. As described in Section 2.1, the algorithm consists of three steps: (a) testing first-order stationarity (b) testing for all , and (c) testing for . If the input point satisfies the second-order sufficient conditions for local minimality, the algorithm decides it is a local minimum. If the point only satisfies second-order necessary conditions, it returns SOSP. If a strict descent direction is found, the algorithm terminates immediately and returns . A brief description will follow, but the full algorithm (Algorithm 2) and a full proof of correctness are deferred to Appendix A.

1:Input: A tuple of .
2:Test if and .
3:for  do
4:     if  then
5:          Test if .
6:          Test if for all via testing extreme rays of polyhedral cones.
7:          Store extreme rays s.t.  for second-order test.
8:     else
9:          Test if .
10:     end if
11:end for
12:For all ’s s.t. , test if .
13:if  then
14:     return SOSP.
15:else
16:     return Local Minimum.
17:end if
Algorithm 1 SOSP-Check (Rough pseudocode)

3.1 Testing first-order stationarity (lines 2, 5, and 9)

Line 2 of Algorithm 1 corresponds to testing if and are singletons with zero. If not, the opposite direction is a descent direction. More details are in Appendix A.1.1.

Test for and is more difficult because depends on and when there are boundary data points. For each , Line 5 (if ), and Line 9 (if ) test if is in . Note from Definition 2.2 and Lemma 3 that is the convex hull of all possible values of . If , can be tested by solving a convex QP:

(2)

If the solution does not achieve zero objective value, then we can directly return a descent direction. For details please refer to FO-Subdiff-Zero-Test (Algorithm 3) and Appendix A.1.2.

3.2 Testing for all (lines 67)

Linear program formulation.

Lines 67 are about testing if for all directions of . If , with the solution from QP (2) we can write as

Every bisects into two halfspaces, and , in each of which stays constant. Note that by Lemma 2.2, ’s for are linearly independent. So, given boundary data points, they divide the space of into polyhedral cones.

Since is constant in each polyhedral cones, we can let for all , and define an LP for each :

(3)

Solving these LPs and checking if the minimum value is suffices to prove for all small enough perturbations. The constraint is there because any is also orthogonal to . It is equivalent to linearly independent equality constraints. So, the feasible set of LP (3) has linearly independent constraints, which implies that the feasible set is a pointed polyhedral cone with vertex at origin. Since any point in a pointed polyhedral cone is a conical combination (linear combination with nonnegative coefficients) of extreme rays of the cone, checking nonnegativity of the objective function for all extreme rays suffices. We emphasize that we do not solve the LPs (3) in our algorithm; we just check the extreme rays.

Computational efficiency.

Extreme rays of a pointed polyhedral cone in are computed from linearly independent active constraints. For each , the extreme ray must be tested whether , in both directions. Note that there are extreme rays, and one extreme ray is shared by polyhedral cones. Moreover, for , which indicates that

regardless of . Testing an extreme ray can be done with a single inequality test instead of separate tests for all cones! Thus, this extreme ray approach instead of solving individual LPs greatly reduces computation, from to .

Testing extreme rays.

For the details of testing all possible extreme rays, please refer to FO-Increasing-Test (Algorithm 4) and Appendix A.2. FO-Increasing-Test computes all possible extreme rays and tests if they satisfy . If the inequality is not satisfied by an extreme ray , then this is a descent direction, so we return . If the inequality holds with equality, it means this is a flat extreme ray, and it needs to be checked in second-order test, so we save this extreme ray for future use.

How many flat extreme rays () are there? Presence of flat extreme rays introduce inequality constraints in the QP that we solve in the second-order test. It is ideal not to have them, because in this case there are only equality constraints, so the QP is easier to solve. Lemma A.2 in Appendix A.2 shows the conditions for having flat extreme rays; in short, there is a flat extreme ray if or or . For more details, please refer to Appendix A.2.

3.3 Testing for (lines 1217)

The second-order test checks for “flat” ’s satisfying . This is done with help of the function SO-Test (Algorithm 5). Given its input , it defines fixed “Jacobian” matrices for all data points and equality/inequality constraints for boundary data points, and solves the QP of the following form:

(4)
Constraints and number of QPs.

There are equality constraints of the form . These equality constraints are due to the nonnegative homogeneous property of activation ; i.e., scaling and by and scaling by yields exactly the same network. So, these equality constraints force to be orthogonal to the loss-invariant directions. This observation is stated more formally in Lemma A.3, which as a corollary shows that any differentiable FOSP of always has rank-deficient Hessian. The other constraints make sure that the union of feasible sets of QPs is exactly (please see Lemma A.3 in Appendix A.3 for details). It is also easy to check that these constraints are all linearly independent.

If there is no flat extreme ray, the algorithm solves just one QP with equality constraints. If there are flat extreme rays, the algorithm solves one QP with equality constraints, and more QPs with equality constraints and inequality constraints, where

(5)

Recall from Section 3.2 that has a flat extreme ray if or or ; thus, . Please refer to Appendix A.3 for more details.

Efficiency of solving the QPs (4).

Despite NP-hardness of general QPs, our specific form of QPs (4) can be solved quite efficiently, avoiding exponential complexity in . After solving QP (4), there are three (disjoint) termination conditions:

  1. whenever , or

  2. whenever , but such that , or

  3. such that and ,

where is the feasible set of QP. With the following two lemmas, we show that the termination conditions can be efficiently tested for ECQPs and ICQPs. First, the ECQPs can be iteratively solved with projected gradient descent, as stated in the next lemma. Consider the QP, where is symmetric and has full row rank:

Then, projected gradient descent (PGD) updates

with learning rate converges to a solution or diverges to infinity exponentially fast. Moreover, with random initialization, PGD correctly checks conditions ac with probability . The proof is an extension of unconstrained case [13], and is deferred to Appendix B.3. Note that it takes time to compute in the beginning, and each update takes time. It is also surprising that the convergence rate does not depend on .

In the presence of flat extreme rays, we have to solve QPs involving inequality constraints. We prove that our ICQP can be solved in time, which implies that as long as the number of flat extreme rays is small, the problem can still be solved in polynomial time in . Consider the QP, where is symmetric, and have full row rank, and has rank :

Then, there exists a method that checks whether ac in time. In short, we transform to define an equivalent problem, and use classical results in copositive matrices [14, 22, 9]; the problem can be solved by computing the eigensystem of a matrix, and testing copositivity of an matrix. The proof is presented in Appendix B.4.

Concluding the test.

During all calls to SO-Test, whenever any QP terminated with c, then SOSP-Check immediately returns the direction and terminates. After solving all QPs, if any of SO-Test calls finished with b, then we conclude SOSP-Check with “SOSP.” If all QPs terminated with a, then we can return “Local Minimum.”

4 Experiments

For experiments, we used artificial datasets sampled iid from standard normal distribution, and trained 1-hidden-layer ReLU networks with squared error loss. In practice, it is impossible to get to the exact nondifferentiable point, because they lie in a set of measure zero. To get close to those points, we ran Adam

[11] using full-batch (exact) gradient for 200,000 iterations and decaying step size (start with , decay every 20,000 iterations). We observed that decaying step size had the effect of “descending deeper into the valley.”

After running Adam, for each , we counted the number of approximate boundary data points satisfying

, which gives an estimate of

. Moreover, for these points, we solved the QP (2) using L-BFGS-B [5], to check if the terminated points are indeed (approximate) FOSPs. We could see that the optimal values of (2) are close to zero ( typically, for largest problems). After solving (2), we counted the number of ’s that ended up with or . The number of such ’s is an estimate of . We also counted the number of approximate boundary data points satisfying , for an estimate of .

We ran the above-mentioned experiments for different settings of , 40 times each. We fixed for simplicity. For large , the optimizer converged to near-zero minima, making uniformly small, so it was difficult to obtain accurate estimates of and . Thus, we had to perform experiments in settings where the optimizer converged to minima that are far from zero.

# Runs Sum (Avg.) Sum (Avg.) Sum (Avg.)
40 290 (7.25) 0 (0) 0 (0) 0
40 371 (9.275) 1 (0.025) 0 (0) 0.025
40 1,452 (36.3) 0 (0) 0 (0) 0
40 2,976 (74.4) 2 (0.05) 0 (0) 0.05
40 24,805 (620.125) 4 (0.1) 0 (0) 0.1
40 14,194 (354.85) 0 (0) 0 (0) 0
40 42,334 (1,058.35) 37 (0.925) 1 (0.025) 0.625
Table 1: Summary of experimental results

Table 1 summarizes the results. Through 280 runs, we observed that there are surprisingly many boundary data points () in general, but usually there are zero or very few (maximum was 3) flat extreme rays (). This observation suggests two important messages: (1) many local minima are on nondifferentiable points, which is the reason why our analysis is meaningful; (2) luckily, is usually very small, so we only need to solve ECQPs () or ICQPs with very small number of inequality constraints, which are solved efficiently (Lemmas 3.3 and 3.3). We can observe that , , and indeed increase as model dimensions and training set get larger, but the rate of increase is not as fast as , , and .

For our experiments, we used artificial regression datasets instead of real ones. This is because popular real datasets are classification datasets, but cross-entropy loss gets arbitrarily small as parameters tend to infinity and hinge loss is not differentiable; the losses are not adequate for our purposes.

5 Discussion and Future Work

We provided an algorithm to test second-order stationarity and escape saddle points, for nondifferentiable points of empirical risk of shallow ReLU-like networks. Despite difficulty raised by boundary data points dividing the parameter space into regions, we reduced the computation to convex QPs, equality/inequality tests, and one (or a few more) QP. In benign cases, the last QP is equality constrained, which can be efficiently solved with projected gradient descent. In the worst case, the QP has a few (say ) inequality constraints, but it can be solved efficiently when is small. We also provided empirical evidences that is usually either zero or very small, suggesting that the test can be done efficiently in most cases. Extending this test to deeper neural networks is a possible future work. Also, numerical implementation of this algorithm and combining it with practical gradient-based methods will be of great interest.

References

Appendix A Full algorithms and proof of correctness

In this section, we present the detailed operation of SOSP-Check (Algorithm 2), and its helper functions FO-Subdiff-Zero-Test, FO-Increasing-Test, and SO-Test (Algorithm 35).

In the subsequent subsections, we provide a more detailed proof of the correctness of Algorithm 2. Recall that, by Lemmas 2.2 and 3, and vectors are linearly independent. Also, we can expand so that

where and satisfy

and .

a.1 Testing first-order stationarity (lines 24, 711 and 1618)

a.1.1 Test of first-order stationarity for and (lines 24)

Lines 24 of Algorithm 2 correspond to testing if and . If they are not all zero, the opposite direction is a descent direction, as Line 3 returns. To see why, suppose . Then choose perturbations

If we apply perturbation where , we can immediately check that and . So,

and also that . Then, by scaling sufficiently small we can achieve , which disproves that is a local minimum.

a.1.2 Test of first-order stationarity for and (lines 711 and 1618)

Test for and is more difficult because depends on and when there are boundary data points. Recall that () is the -th row of . Then note from Lemma 3 that