A Generic Approach for Escaping Saddle points

09/05/2017
by   Sashank J Reddi, et al.
MIT
Inria
Carnegie Mellon University
0

A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them impractical in large-scale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian based computations while at the same time provably converging to second-order critical points. Our framework carefully alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art. Empirical results suggest that our strategy also enjoys a good practical performance.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/01/2019

Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization

Variance reduction techniques like SVRG provide simple and fast algorith...
02/16/2020

Distributed Averaging Methods for Randomized Second Order Optimization

We consider distributed optimization problems where forming the Hessian ...
05/23/2018

Second-Order Occlusion-Aware Volumetric Radiance Caching

We present a second-order gradient analysis of light transport in partic...
05/03/2014

Supervised Descent Method for Solving Nonlinear Least Squares Problems in Computer Vision

Many computer vision problems (e.g., camera calibration, image alignment...
07/02/2020

Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

In distributed second order optimization, a standard strategy is to aver...
03/04/2022

Second-order Symmetric Non-negative Latent Factor Analysis

Precise representation of large-scale undirected network is the basis fo...
09/06/2018

Escaping Saddle Points in Constrained Optimization

In this paper, we focus on escaping from saddle points in smooth nonconv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study nonconvex finite-sum problems of the form

(1)

where neither nor the individual functions (

) are necessarily convex. We operate in a general nonconvex setting except for few smoothness assumptions like Lipschitz continuity of the gradient and Hessian. Optimization problems of this form arise naturally in machine learning and statistics as empirical risk minimization (ERM) and M-estimation respectively.

In the large-scale settings, algorithms based on first-order information of functions

are typically favored as they are relatively inexpensive and scale seamlessly. An algorithm widely used in practice is stochastic gradient descent (

Sgd), which has the iterative update:

(2)

where is a randomly chosen index and is a learning rate. Under suitable selection of the learning rate, we can show that Sgd converges to a point that, in expectation, satisfies the stationarity condition in iterations Ghadimi and Lan (2013). This result has two critical weaknesses: (i) It does not ensure convergence to local optima or second-order critical points; (ii) The rate of convergence of the Sgd algorithm is slow.

For general nonconvex problems, one has to settle for a more modest goal than sub-optimality, as finding the global minimizer of finite-sum nonconvex problem will be in general intractably hard. Unfortunately, Sgd does not even ensure second-order critical conditions such as local optimality since it can get stuck at saddle points. This issue has recently received considerable attention in the ML community, especially in the context of deep learning Dauphin et al. (2014, 2015); Choromanska et al. (2014). These works argue that saddle points are highly prevalent in most optimization paths, and are the primary obstacle for training large deep networks. To tackle this issue and achieve a second-order critical point for which and , we need algorithms that either use the Hessian explicitly or exploit its structure.

A key work that explicitly uses Hessians to obtain faster convergence rates is the cubic regularization (CR) method Nesterov and Polyak (2006). In particular, Nesterov and Polyak (2006) showed that CR requires iterations to achieve the second-order critical conditions. However, each iteration of CR is expensive as it requires computing the Hessian and solving multiple linear systems, each of which has complexity ( is the matrix multiplication constant), thus, undermining the benefit of its faster convergence. Recently, Agarwal et al. (2016a) designed an algorithm to solve the CR more efficiently, however, it still exhibits slower convergence in practice compared to first-order methods. Both of these approaches use Hessian based optimization in each iteration, which make them slow in practice.

A second line of work focuses on using Hessian information (or its structure) whenever the method gets stuck at stationary points that are not second-order critical. To our knowledge, the first work in this line is Ge et al. (2015), which shows that for a class of functions that satisfy a special property called “strict-saddle” property, a noisy variant of Sgd

can converge to a point close to a local minimum. For this class of functions, points close to saddle points have a Hessian with a large negative eigenvalue, which proves instrumental in escaping saddle points using an isotropic noise. While such a noise-based method is appealing as it only uses first-order information, it has a very bad dependence on the dimension

, and furthermore, the result only holds when the strict-saddle property is satisfied Ge et al. (2015). More recently, Carmon et al. (2016) presented a new faster algorithm that alternates between first-order and second-order subroutines. However, their algorithm is designed for the simple case of in (1) and hence, can be expensive in practice.

figuret

Figure 1: First order methods like GD can potentially get stuck at saddle points. Second-order methods can escape it in very few iterations (as observed in the left plot) but at the cost of expensive Hessian based iterations (see time plot to the right). The proposed framework, which is a novel mix of the two strategies, can escape saddle points faster in time by carefully trading off computation and iteration complexity.

Inspired by this line of work, we develop a general framework for finding second-order critical points. The key idea of our framework is to use first-order information for the most part of the optimization process and invoke Hessian information only when stuck at stationary points that are not second-order critical. We summarize the key idea and main contributions of this paper below.

Main Contributions: We develop an algorithmic framework for converging to second-order critical points and provide convergence analysis for it. Our framework carefully alternates between two subroutines that use gradient and Hessian information, respectively, and ensures second-order criticality. Furthermore, we present two instantiations of our framework and provide convergence rates for them. In particular, we show that a simple instance of our framework, based on Svrg, achieves convergence rates competitive with the current state-of-the-art methods; thus highlighting the simplicity and applicability of our framework. Finally, we demonstrate the empirical performance of a few algorithms encapsulated by our framework and show their superior performance.

Related Work.

There is a vast literature on algorithms for solving optimization problems of the form (1). A classical approach for solving such optimization problems is Sgd, which dates back at least to the seminal work of Robbins and Monro (1951). Since then, Sgd has been a subject of extensive research, especially in the convex setting Poljak and Tsypkin (1973); Ljung (1977); Bottou (1991); Kushner and Clark (2012)

. Recently, new faster methods, called variance reduced (VR) methods, have been proposed for convex finite-sum problems. VR methods attain faster convergence by reducing the variance in the stochastic updates of

Sgd, see e.g., Defazio et al. (2014a); Johnson and Zhang (2013); Schmidt et al. (2013); Konečný et al. (2015); Shalev-Shwartz and Zhang (2013); Defazio et al. (2014b). Accelerated variants of these methods achieve the lower bounds proved in Agarwal and Bottou (2014); Lan and Zhou (2015), thereby settling the question of their optimality. Furthermore, Reddi et al. (2015) developed an asynchronous framework for VR methods and demonstrated their benefits in parallel environments.

Most of the aforementioned prior works study stochastic methods in convex or very specialized nonconvex settings that admit theoretical guarantees on sub-optimality. For the general nonconvex setting, it is only recently that non-asymptotic convergence rate analysis for Sgd and its variants was obtained in Ghadimi and Lan (2013), who showed that Sgd ensures (in expectation) in iterations. A similar rate for parallel and distributed Sgd was shown in Lian et al. (2015). For these problems, Reddi et al. (2016a, b, c) proved faster convergence rates that ensure the same optimality criteria in , which is an order faster than GD. While these methods ensure convergence to stationary points at a faster rate, the question of convergence to local minima (or in general to second-order critical points) has not been addressed. To our knowledge, convergence rates to second-order critical points (defined in Definition 1) for general nonconvex functions was first studied by Nesterov and Polyak (2006). However, each iteration of the algorithm in Nesterov and Polyak (2006) is prohibitively expensive since it requires eigenvalue decompositions, and hence, is unsuitable for large-scale high-dimensional problems. More recently, Carmon et al. (2016); Agarwal et al. (2016a) presented algorithms for finding second-order critical points by tackling some practical issues that arise in Nesterov and Polyak (2006). However, these algorithms are either only applicable to a restricted setting or heavily use Hessian based computations, making them unappealing from a practical standpoint. Noisy variants of first-order methods have also been shown to escape saddle points (see Ge et al. (2015); Jin et al. (2017); Levy (2016)), however, these methods have strong dependence on either or , both of which are undesirable.

2 Background & Problem Setup

We assume that each of the functions in (1) is -smooth, i.e., for all . Furthermore, we assume that the Hessian of in (1) is Lipschitz, i.e., we have

(3)

for all . Such a condition is typically necessary to ensure convergence of algorithms to the second-order critical points Nesterov and Polyak (2006). In addition to the above smoothness conditions, we also assume that the function is bounded below, i.e., for all .

In order to measure stationarity of an iterate , similar to Nesterov (2003); Ghadimi and Lan (2013); Nesterov and Polyak (2006), we use the condition . In this paper, we are interested in convergence to second-order critical points. Thus, in addition to stationarity, we also require the solution to satisfy the Hessian condition Nesterov and Polyak (2006). For iterative algorithms, we require both as the number of iterations . When all saddle points are non-degenerate, such a condition implies convergence to a local optimum.

Definition 1.

An algorithm is said to obtain a point that is a -second order critical point if and , where the expectation is over any randomness in .

We must exercise caution while interpreting results pertaining to -second order critical points. Such points need not be close to any local minima either in objective function value, or in the domain of (1). For our algorithms, we use only an Incremental First-order Oracle (IFO) Agarwal and Bottou (2014) and an Incremental Second-order Oracle (ISO), defined below.

Definition 2.

An IFO takes an index and a point , and returns the pair . An ISO takes an index , point

and vector

and returns the vector .

IFO and ISO calls are typically cheap, with ISO call being relatively more expensive. In many practical settings that arise in machine learning, the time complexity of these oracle calls is linear in Agarwal et al. (2016b); Pearlmutter (1994). For clarity and clean comparison, the dependence of time complexity on Lipschitz constant , , initial point and any polylog factors in the results is hidden.

3 Generic Framework

In this section, we propose a generic framework for escaping saddle points while solving nonconvex problems of form (1). One of the primary difficulties in reaching a second-order critical point is the presence of saddle points. To evade such points, one needs to use properties of both gradients and Hessians. To this end, our framework is based on two core subroutines: Gradient-Focused-Optimizer and Hessian-Focused-Optimizer.

The idea is to use these two subroutines, each focused on different aspects of the optimization procedure. Gradient-Focused-Optimizer focuses on using gradient information for decreasing the function. On its own, the Gradient-Focused-Optimizer might not converge to a local minimizer since it can get stuck at a saddle point. Hence, we require the subroutine Hessian-Focused-Optimizer to help avoid saddle points. A natural idea is to interleave these subroutines to obtain a second-order critical point. But it is not even clear if such a procedure even converges. We propose a carefully designed procedure that effectively balances these two subroutines, which not only provides meaningful theoretical guarantees, but remarkably also translates into strong empirical gains in practice.

Algorithm 1 provides pseudocode of our framework. Observe that the algorithm is still abstract, since it does not specify the subroutines Gradient-Focused-Optimizer and Hessian-Focused-Optimizer. These subroutines determine the crucial update mechanism of the algorithm. We will present specific instance of these subroutines in the next section, but we assume the following properties to hold for these subroutines.

1:  Input - Initial point: , total iterations , error threshold parameters ,

and probability

2:  for  to  do
3:      (refer to 1 and 2)
4:     Choose as with probability and with probability
5:      (refer to 1 and 2)
6:     if  then
7:         Output set
8:     end if
9:  end for
10:  Output set
Algorithm 1 Generic Framework
  • [leftmargin=*]

  • Gradient-Focused-Optimizer: Suppose = , then there exists positive function , such that

    1. [label=G.0]

    2. ,

    3. .

    Here the outputs . The expectation in the conditions above is over any randomness that is a part of the subroutine. The function will be critical for the overall rate of Algorithm 1. Typically, Gradient-Focused-Optimizer is a first-order method, since the primary aim of this subroutine is to focus on gradient based optimization.

  • Hessian-Focused-Optimizer: Suppose where and . If , then is a -second order critical point with probability at least . Otherwise if , then satisfies the following condition:

    1. [label=H.0]

    2. ,

    3. when for some function .

    Here the expectation is over any randomness in subroutine Hessian-Focused-Optimizer. The two conditions ensure that the objective function value, in expectation, never increases and furthermore, decreases with a certain rate when . In general, this subroutine utilizes the Hessian or its properties for minimizing the objective function. Typically, this is the most expensive part of the Algorithm 1 and hence, needs to be invoked judiciously.

The key aspect of these subroutines is that they, in expectation, never increase the objective function value. The functions and will determine the convergence rate of Algorithm 1. In order to provide a concrete implementation, we need to specify the aforementioned subroutines. Before we delve into those details, we will provide a generic convergence analysis for Algorithm 1.

Convergence Analysis

Theorem 1.

Let and . Also, let set be the output of Algorithm 1 with Gradient-Focused-Optimizer satisfying 1 and 2 and Hessian-Focused-Optimizer satisfying 1 and 2. Furthermore, be such that .

Suppose the multiset are indices selected independently and uniformly randomly from {1, …, }. Then the following holds for the indices in :

  1. [leftmargin=*]

  2. , where , is a -critical point with probability at least .

  3. If , with at least probability , at least one iterate where is a -critical point.

The proof of the result is presented in Appendix A. The key point regarding the above result is that the overall convergence rate depends on the magnitude of both functions and . Theorem  1 shows that the slowest amongst the subroutines Gradient-Focused-Optimizer and Hessian-Focused-Optimizer governs the overall rate of Algorithm 1. Thus, it is important to ensure that both these procedures have good convergence. Also, note that the optimal setting for based on the result above satisfies . We defer further discussion of convergence to next section, where we present more specific convergence and rate analysis.

4 Concrete Instantiations

We now present specific instantiations of our framework in this section. Before we state our key results, we discuss an important subroutine that is used as Gradient-Focused-Optimizer for rest of this paper: Svrg. We give a brief description of the algorithm in this section and show that it meets the conditions required for a Gradient-Focused-Optimizer. Svrg Johnson and Zhang (2013); Reddi et al. (2016a) is a stochastic algorithm recently shown to be very effective for reducing variance in finite-sum problems. We seek to understand its benefits for nonconvex optimization, with a particular focus on the issue of escaping saddle points. Algorithm 2 presents Svrg’s pseudocode.

1:  Input:

, epoch length

, step sizes , iterations ,
2:  for  to  do
3:     
4:     
5:     for  to  do
6:         Uniformly randomly pick from
7:         
8:         
9:     end for
10:  end for
11:  Output: where is Iterate chosen uniformly random from and .
Algorithm 2 SVRG

Observe that Algorithm 2 is an epoch-based algorithm. At the start of each epoch , a full gradient is calculated at the point , requiring calls to the IFO. Within its inner loop Svrg performs stochastic updates. Suppose is chosen to be (typically used in practice), then the total IFO calls per epoch is . Strong convergence rates have been proved Algorithm 2 in the context of convex and nonconvex optimization Johnson and Zhang (2013); Reddi et al. (2016a). The following result shows that Svrg meets the requirements of a Gradient-Focused-Optimizer.

Lemma 1.

Suppose , and , which depends on , then Algorithm 2 is a Gradient-Focused-Optimizer with .

In rest of this section, we discuss approaches using Svrg as a Gradient-Focused-Optimizer. In particular, we propose and provide convergence analysis for two different methods with different Hessian-Focused-Optimizer but which use Svrg as a Gradient-Focused-Optimizer.

4.1 Hessian descent

The first approach is based on directly using the eigenvector corresponding to the smallest eigenvalue as a

Hessian-Focused-Optimizer. More specifically, when the smallest eigenvalue of the Hessian is negative and reasonably large in magnitude, the Hessian information can be used to ensure descent in the objective function value. The pseudo-code for the algorithm is given in Algorithm 3.

The key idea is to utilize the minimum eigenvalue information in order to make a descent step. If then the idea is to use this information to take a descent step. Note the subroutine is designed in a fashion such that the objective function value never increases. Thus, it naturally satisfies the requirement 1 of Hessian-Focused-Optimizer. The following result shows that HessianDescent is a Hessian-Focused-Optimizer.

Lemma 2.

HessianDescent is a Hessian-Focused-Optimizer with .

The proof of the result is presented in Appendix C. With Svrg as Gradient-Focused-Optimizer and HessianDescent as Hessian-Focused-Optimizer, we show the following key result:

Theorem 2.

Suppose Svrg with , for all and is used as Gradient-Focused-Optimizer and HessianDescent is used as Hessian-Focused-Optimizer with , then Algorithm 1 finds a -second order critical point in with probability at least .

The result directly follows from using Lemma 1 and 2 in Theorem 1. The result shows that the iteration complexity of Algorithm 1 in this case is . Thus, the overall IFO complexity of Svrg algorithm is . Since each IFO call takes time, the overall time complexity of all Gradient-Focused-Optimizer steps is . To understand the time complexity of HessianDescent, we need the following result Agarwal et al. (2016a).

1:  Find such that , and with probability at least the following inequality holds: .
2:  Set .
3:  .
4:  
5:  Output: .
Algorithm 3 HessianDescent
Preposition 1.

The time complexity of finding that , and with probability at least the following inequality holds: is .

Note that each iteration of Algorithm 1 in this case has just linear dependence on . Since the total number of HessianDescent iterations is and each iteration has the complexity of , using the above remark, we obtain an overall time complexity of HessianDescent is . Combining this with the time complexity of Svrg, we get the following result.

Corollary 1.

The overall running time of Algorithm 1 to find a -second order critical point, with parameter settings used in Theorem 2, is .

Note that the dependence on is much better in comparison to that of Noisy SGD used in Ge et al. (2015). Furthermore, our results are competitive with Agarwal et al. (2016a); Carmon et al. (2016) in their respective settings, but with a much simpler algorithm and analysis. We also note that our algorithm is faster than the one proposed in Jin et al. (2017), which has a time complexity of .

4.2 Cubic Descent

In this section, we show that the cubic regularization method in Nesterov and Polyak (2006) can be used as Hessian-Focused-Optimizer. More specifically, here Hessian-Focused-Optimizer approximately solves the following optimization problem:

(CubicDescent)

and returns as output. The following result can be proved for this approach.

Theorem 3.

Suppose Svrg (same as Theorem 2) is used as Gradient-Focused-Optimizer and CubicDescent is used as Hessian-Focused-Optimizer with , then Algorithm 1 finds a -second order critical point in with probability at least .

In principle, Algorithm 1 with CubicDescent as Hessian-Focused-Optimizer can converge without the use of Gradient-Focused-Optimizer subroutine at each iteration since it essentially reduces to the cubic regularization method of Nesterov and Polyak (2006). However, in practice, we would expect Gradient-Focused-Optimizer to perform most of the optimization and Hessian-Focused-Optimizer to be used for far fewer iterations. Using the method developed in Nesterov and Polyak (2006) for solving CubicDescent, we obtain the following corollary.

Corollary 2.

The overall running time of Algorithm 1 to find a -second order critical point, with parameter settings used in Theorem 3, is .

Here is the matrix multiplication constant. The dependence on is weaker in comparison to Corollary 1. However, each iteration of CubicDescent is expensive (as seen from the factor in the corollary above) and thus, in high dimensional settings typically encountered in machine learning, this approach can be expensive in comparison to HessianDescent.

4.3 Practical Considerations

The focus of this section was to demonstrate the wide applicability of our framework; wherein using a simple instantiation of this framework, we could achieve algorithms with fast convergence rates. To further achieve good empirical performance, we had to slightly modify these procedures. For Hessian-Focused-Optimizer, we found stochastic, adaptive and inexact approaches for solving HessianDescent and CubicDescent work well in practice. Due to lack of space, the exact description of these modifications is deferred to Appendix F. Furthermore, in the context of deep learning, empirical evidence suggests that first-order methods like Adam Kingma and Ba (2014) exhibit behavior that is in congruence with properties 1 and 2. While theoretical analysis for a setting where Adam is used as Gradient-Focused-Optimizer is still unresolved, we nevertheless demonstrate its performance through empirical results in the following section.

5 Experiments

We now present empirical results for our saddle point avoidance technique with an aim to highlight three aspects: (i) the framework successfully escapes non-degenerate saddle points, (ii) the framework is fast, and (iii) the framework is practical on large-scale problems. All the algorithms are implemented on TensorFlow

Abadi et al. (2015). In case of deep networks, the Hessian-vector product is evaluated using the trick presented in Pearlmutter (1994). We run our experiments on a commodity machine with Intel Xeon CPU E5-2630 v4 CPU, 256GB RAM, and NVidia Titan X (Pascal) GPU.

Figure 2: Comparison of various methods on a synthetic problem. Our mix framework successfully escapes saddle point and uses relatively few ISO calls in comparison to CubicDescent.

Synthetic Problem To demonstrate the fast escape from a saddle point by the proposed method, we consider the following simple nonconvex finite-sum problem:

(4)

Here the parameters are designed such that and matrix has exactly one negative eigenvalue of and other eigenvalues randomly chosen in the interval . The total number of examples is set to be 100,000 and is . It is not hard to see that this problem has a non-degenerate saddle point at the origin. This allows us to explore the behaviour of different optimization algorithms in the vicinity of the saddle point. In this experiment, we compare a mix of Svrg and HessianDescent (as in Theorem 2) with Sgd (with constant step size), Adam, Svrg and CubicDescent. The parameter of these algorithms is chosen by grid search so that it gives the best performance. The subproblem of CubicDescent was solved with gradient descent Carmon et al. (2016) until the gradient norm of the subproblem is reduced below . We study the progress of optimization, i.e., decrease in function value with wall clock time, IFO calls, and ISO calls. All algorithms were initialized with the same starting point very close to origin.

The results are presented in Figure 2, which shows that our proposed mix framework was the fastest to escape the saddle point in terms of wall clock time. We observe that performance of the first order methods suffered severely due to the saddle point. Note that Sgd eventually escaped the saddle point due to inherent noise in the mini-batch gradient. CubicDescent, a second-order method, escaped the saddle point faster in terms of iterations using the Hessian information. But operating on Hessian information is expensive as a result this method was slow in terms of wall clock time. The proposed framework, which is a mix of the two strategies, inherits the best of both worlds by using cheap gradient information most of the time and reducing the use of relatively expensive Hessian information (ISO calls) by 100x. This resulted in faster escape from saddle point in terms of wall clock time.

Deep Networks

To investigate the practical performance of the framework for deep learning problems, we applied it to two deep autoencoder optimization problems from

Hinton and Salakhutdinov (2006)

called “CURVES” and “MNIST”. Due to their high difficulty, performance on these problems has become a standard benchmark for neural network optimization methods, e.g.

Martens and Grosse (2015); Sutskever et al. (2013); Vinyals and Povey (2012); Martens (2010). The “CURVES” autoencoder consists of an encoder with layers of size (28x28)-400-200-100- 50-25-6 and a symmetric decoder totaling in 0.85M parameters. The six units in the code layer were linear and all the other units were logistic. The network was trained on 20,000 images and tested on 10,000 new images. The data set contains images of curves that were generated from three randomly chosen points in two dimensions. The “MNIST” autoencoder consists of an encoder with layers of size (28x28)-1000-500-250-30 and a symmetric decoder, totaling in 2.8M parameters. The thirty units in the code layer were linear and all the other units were logistic. The network was trained on 60,000 images and tested on 10,000 new images. The data set contains images of handwritten digits 0-9. The pixel intensities were normalized to lie between 0 and 1.111Data available at: www.cs.toronto.edu/~jmartens/digs3pts_1.mat, mnist_all.mat

Figure 3: Comparison of various methods on CURVES and MNIST Deep Autoencoder. Our mix approach converges faster than the baseline methods and uses relatively few ISO calls in comparison to ApproxCubicDescent.

As an instantiation of our framework, we use a mix of Adam, which is popular in deep learning community, and an ApproxCubicDescent for the practical reasons mentioned in Section 4.3. This method with Adam and ApproxCubicDescent. The parameters of these algorithms were chosen to produce the best generalization on a held out test set. The regularization parameter was chosen as the smallest value such that the function value does not fluctuate in the first 10 epochs. We use the initialization suggested in Martens (2010) and a mini-batch size of 1000 for all the algorithms. We report objective function value against wall clock time and ISO calls.

The results are presented in Figure 3, which shows that our proposed mix framework was the fastest to escape the saddle point in terms of wall clock time. Adam took considerably more time to escape the saddle point, especially in the case of MNIST. While ApproxCubicDescent escaped the saddle point in relatively fewer iterations, each iteration required considerably large number of ISO calls; as a result, the method was extremely slow in terms of wall clock time, despite our efforts to improve it via approximations and code optimizations. On the other hand, our proposed framework, seamlessly balances these two methods, thereby, resulting in the fast decrease of training loss.

6 Discussion

In this paper, we examined a generic strategy to escape saddle points in nonconvex finite-sum problems and presented its convergence analysis. The key intuition is to alternate between a first-order and second-order based optimizers; the latter is mainly intended to escape points that are only stationary but are not second-order critical points. We presented two different instantiations of our framework and provided their detailed convergence analysis. While both our methods explicity use the Hessian information, one can also use noisy first-order methods as Hessian-Focused-Optimizer (see for e.g. noisy Sgd in Ge et al. (2015)). In such a scenario, we exploit the negative eigenvalues of the Hessian to escape saddle points by using isotropic noise, and do not explicitly use ISO. For these methods, under strict-saddle point property Ge et al. (2015), we can show convergence to local optima within our framework.

We primarily focused on obtaining second-order critical points for nonconvex finite-sums (1). This does not necessarily imply low test error or good generalization capabilities. Thus, we should be careful when interpreting the results presented in this paper. A detailed discussion or analysis of these issues is out of scope of this paper. While a few prior works argue for convergence to local optima, the exact connection between generalization and local optima is not well understood, and is an interesting open problem. Nevertheless, we believe the techniques presented in this paper can be used alongside other optimization tools for faster and better nonconvex optimization.

References

Appendix A Proof of Theorem 1

The case of can be handled in a straightforward manner, so let us focus on the case where . We split our analysis into cases, each analyzing the change in objective function value depending on second-order criticality of .

We start with the case where the gradient condition of second-order critical point is violated and then proceed to the case where the Hessian condition is violated.

Case I: for some

We first observe the following: . This follows from a straightforward application of Jensen’s inequality. From this inequality, we have the following:

(5)

This follows from the fact that is the output of Gradient-Focused-Optimizer subroutine, which satisfies the condition that for = , we have

From Equation (5), we have

Furthermore, due to the property of non-increasing nature of Gradient-Focused-Optimizer, we also have .

We now focus on the Hessian-Focused-Optimizer subroutine. From the property of Hessian-Focused-Optimizer that the objective function value is non-increasing, we have . Therefore, combining with the above inequality, we have

(6)

The first equality is due to the definition of in Algorithm 1. Therefore, when the gradient condition is violated, irrespective of whether or , the objective function value always decreases by at least .

Case II: and for some

In this case, we first note that for and , we have . Observe that . Therefore, if and , then we have

The second inequality is due to the non-increasing property of Gradient-Focused-Optimizer. On the other hand, if , we have hand, if we have . This is due to the non-increasing property of Hessian-Focused-Optimizer. Combining the above two inequalities and using the law of total expectation, we get

(7)

The second inequality is due to he non-increasing property of Gradient-Focused-Optimizer. Therefore, when the hessian condition is violated, the objective function value always decreases by at least .

Case III: and for some

This is the favorable case for the algorithm. The only condition to note is that the objective function value will be non-increasing in this case too. This is, again, due to the non-increasing properties of subroutines Gradient-Focused-Optimizer and Hessian-Focused-Optimizer. In general, greater the occurrence of this case during the course of the algorithm, higher will the probability that the output of our algorithm satisfies the desired property.

The key observation is that Case I & II cannot occur large number of times since each of these cases strictly decreases the objective function value. In particular, from Equation (6) and (7), it is easy to see that each occurrence of Case I & II the following holds:

where . Furthermore, the function is lower bounded by B, thus, Case I & II cannot occur more than times. Therefore, the probability of occurrence of Case III is at least , which completes the first part of the proof.

The second part of the proof simply follows from first part. As seen above, the probability of Case I & II is at most . Therefore, probability that an element of the set falls in Case III is at least , which gives us the required result for the second part.

Appendix B Proof of Lemma 1

Proof.

The proof follows from the analysis in Reddi et al. [2016a] with some additional reasoning. We need to show two properties: 1 and 2, both of which are based on objective function value. To this end, we start with an update in the epoch. We have the following:

(8)

The first inequality is due to -smoothness of the function . The second inequality simply follows from the unbiasedness of Svrg update in Algorithm 2. For the analysis of the algorithm, we need the following Lyapunov function:

This function is a combination of objective function and the distance of the current iterate from the latest snapshot . Note that the term is introduced only for the analysis and is not part of the algorithm (see Algorithm 2). Here is chosen such the following holds:

for all and . For bounding the Lypunov function , we need the following bound on the distance of the current iterate from the latest snapshot:

(9)

The second equality is due to the unbiasedness of the update of Svrg. The last inequality follows from a simple application of Cauchy-Schwarz and Young’s inequality. Substituting Equation (8) and Equation (9) into the Lypunov function , we obtain the following:

(10)

To further bound this quantity, we use Lemma 3 to bound , so that upon substituting it in Equation (10), we see that

The second inequality follows from the definition of and . Since for and ,

(11)

where

We will prove that for the given parameter setting (see the proof below). With , it is easy to see that . Furthermore, note that since (see Algorithm 2). Also, we have

and thus, we obtain for all . Furthermore, using simple induction and the fact that for all epoch , it easy to see that . Therefore, with the definition of specified in the output of Algorithm 2, we see that the condition  1 of Gradient-Focused-Optimizer is satisfied for Svrg algorithm.

We now prove that and also 2 of Gradient-Focused-Optimizer is satisifed for Svrg algorithm. By using telescoping the sum with in Equation (11), we obtain

This inequality in turn implies that

(12)

where we used that (since ), and that (since ). Now sum over all epochs to obtain

(13)

Here we used the the fact that . To obtain a handle on and complete our analysis, we will require an upper bound on . We observe that where . This is obtained using the relation and the fact that . Using the specified values of and we have

Using the above bound on , we get

(14)

wherein the second inequality follows upon noting that is increasing for and (here is the Euler’s number). Now we can lower bound , as