DeepAI
Log In Sign Up

A Stochastic Trust Region Method for Non-convex Minimization

03/04/2019
by   Zebang Shen, et al.
0

We target the problem of finding a local minimum in non-convex finite-sum minimization. Towards this goal, we first prove that the trust region method with inexact gradient and Hessian estimation can achieve a convergence rate of order O(1/k^2/3) as long as those differential estimations are sufficiently accurate. Combining such result with a novel Hessian estimator, we propose the sample-efficient stochastic trust region (STR) algorithm which finds an (ϵ, √(ϵ))-approximate local minimum within O(√(n)/ϵ^1.5) stochastic Hessian oracle queries. This improves state-of-the-art result by O(n^1/6). Experiments verify theoretical conclusions and the efficiency of STR.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/23/2017

Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information

We consider variants of trust-region and cubic regularization methods fo...
10/13/2015

Adopting Robustness and Optimality in Fitting and Learning

We generalized a modified exponentialized estimator by pushing the robus...
09/26/2018

Stochastic Second-order Methods for Non-convex Optimization with Inexact Hessian and Gradient

Trust region and cubic regularization methods have demonstrated good per...
07/07/2021

A stochastic first-order trust-region method with inexact restoration for finite-sum minimization

We propose a stochastic first-order trust-region method with inexact fun...
11/04/2019

Generalized Self-concordant Hessian-barrier algorithms

Many problems in statistical learning, imaging, and computer vision invo...
10/15/2020

Solving Trust Region Subproblems Using Riemannian Optimization

The Trust Region Subproblem is a fundamental optimization problem that t...
12/29/2017

A Stochastic Trust Region Algorithm

An algorithm is proposed for solving stochastic and finite sum minimizat...

1 Introduction

We consider the following finite-sum minimization problem

(1)

where each (non-convex) component function is assumed to have -Lipschitz continuous gradient and -Lipschitz continuous Hessian. Since first-order stationary points could be saddle points and thus lead to inferior generalization performance (Dauphin et al., 2014), in this work we are particularly interested in computing -approximate second-order stationary points, -SOSP:

(2)

To find the local minimum in problem (1), the cubic regularization approach (Nesterov and Polyak, 2006) and the trust region algorithm (Conn et al., 2000; Curtis et al., 2017) are two classical methods. Specifically, cubic regularization forms a cubic surrogate function for the objective by adding a third-order regularization term to the second-order Taylor expansion, and minimizes it iteratively. Such a method is proved to achieve an global convergence rate and thus needs stochastic first- and second-order oracle queries, namely the evaluation number of stochastic gradient and Hessian, to achieve a point that satisfies (2). On the other hand, trust region algorithms estimate the objective with its second-order Taylor expansion but minimize it only within a local region. Recently, Curtis et al. (2017) propose a trust region variant to achieve the same convergence rate as the cubic regularization approach. But both methods require computing full gradients and Hessians of the objective and thus suffer from high computational cost in large-scale problems.

To avoid coslty exact differential evaluations, many works explore the finite-sum structure of problem (1) and develop stochastic cubic regularization approaches. Both Kohler and Lucchi (2017b) and Xu et al. (2017) propose to directly subsample the gradient and Hessian in the cubic surrogate function, and achieve and

stochastic first- and second-order oracle complexities respectively. By plugging a stochastic variance reduced estimator

(Johnson and Zhang, 2013) and the Hessian tracking technique (Gower et al., 2018) into the gradient and Hessian estimation, the approach in Zhou et al. (2018a) improves both the stochastic first- and second-order oracle complexities to . Recently, (Zhang et al., 2018; Zhou et al., 2018b) develop more efficient stochastic cubic regularization variants, which further reduce the stochastic second-order oracle complexities to at the cost of increasing the stochastic first-order oracle complexity to .

Contributions: In this paper we propose and exploit a formulation in which we make explicit control of the step size in the trust region method. This idea is leveraged to develop two efficient stochastic trust region (STR) approaches. We tailor our methods to achieve state-of-the-art oracle complexities under the following two measurements: (i) the stochastic second-order oracle complexity is prioritized; (ii) the stochastic first- and second-order oracle complexities are treated equally. Specifically, in Setting (i), our method STR employs a newly proposed estimator to approximate the Hessian and adopts the estimator in (Fang et al., 2018) for gradient approximation. Our novel Hessian estimator maintains a high accuracy second-order differential approximation with lower amortized oracle complexity. In this way, STR achieves stochastic second-order oracle complexity. This is lower than existing results for solving problem (1). In Setting (ii), our method STR substitutes the gradient estimator in STR with one that integrates stochastic gradient and Hessian together to maintain an accurate gradient approximation. As a result, STR achieves convergence in overall stochastic first- and second-order oracle queries.

1.1 Related Work

Computing local minimum to a non-convex optimization problem is gaining considerable amount of attentions in recent years. Both cubic regularization (CR) approaches (Nesterov and Polyak, 2006) and trust region (TR) algorithms (Conn et al., 2000; Curtis et al., 2017)

can escape saddle points and find a local minimum by iterating the variable along the direction related to the eigenvector of the Hessian with the most negative eigenvalue. As the CR heavily depends the regularization parameter for the cubic term,

Cartis et al. (2011) propose an adaptive cubic regularization (ARC) approach to boost the efficiency by adaptively tunes the regularization parameter according to the current objective decrease. Noting the high cost of full gradient and Hessian computation in ARC, sub-sampled cubic regularization (SCR) (Kohler and Lucchi, 2017a) is developed for sampling partial data points to estimate the full gradient and Hessian. Recently, by exploring the finite-sum structure of the target problem, many works incorporate variance-reduced technique (Johnson and Zhang, 2013) into CR and propose stochastic variance-reduced methods. For example, Zhou et al. (2018c) propose stochastic variance-reduced cubic (SVRC) in which they integrate the stochastic variance-reduced gradient estimator (Johnson and Zhang, 2013) and the Hessian tracking technique (Gower et al., 2018) with CR. Such a method is proved to be at least faster than CR and TR. Then Zhou et al. (2018b) suggest to use adaptive gradient batch size and constant Hessian batch size, and develop Lite-SVRC to further reduce the stochastic second-order oracle of SVRC to at the cost of higher gradient computation cost. Similarly, except turning the gradient batch size, Zhang et al. (2018) further adaptively sample a certain number of data points to estimate the Hessian and prove the proposed method to have the same stochastic second-order oracle complexity as Lite-SVRC.

Algorithm SFO SSO
TR
CR
SCR
SVRC
Lite-SVRC
STR
STR
Table 1: Stochastic first- and second-order oracle complexities, SFO and SSO for short respectively, of the proposed STR approaches and other state-of-the-arts methods. When SSO is prioritized, our STR has strictly better complexity than both SCR and Lite-SVRC. When SFO an SSO are treated equally, STR improves the existing result in SVRC.

2 Preliminary

Notation. We use

to denote the Euclidean norm of vector

and use to denote the spectral norm of matrix . Let be the set of component indices. We define the batch average of component function by

Then we specify the assumptions that are necessary to the analysis of our methods.

Assumption 2.1.

is bounded from below and its global optimal is achieved at . We further denote

Assumption 2.2.

Each component function has -Lipschitz continuous gradient: for any

(3)

Clearly, the objective as the average of component functions also has -Lipschitz continuous gradient.

Assumption 2.3.

Each component function has -Lipschitz continuous Hessian: for any

(4)

Similarly, the objective has -Lipschitz continuous Hessian, which implies the following: for any ,

2.1 Trust Region Method

The trust region method has a long history (Conn et al., 2000). In each step, it solves the Quadratic Constraint Quadratic Program (QCQP)

(5)

where is the trust-region radius, and updates

(6)

Since is indefinite, the trust-region subproblem (5) is non-convex, but its global optimizer can be characterized by the following lemma.

Lemma 2.1 (Corollary 7.2.2 in (Conn et al., 2000)).

Any global minimizer of problem (5) satisfies the equation

(7)

where the dual variable should satisfy and .

In particular, the standard QCQP solver returns both the minimizer as well as the corresponding dual variable of subproblem (5). While it is known that the trust-region update (5) and (6) converges at the rate , recently (Curtis et al., 2017) proposes a trust-region variant which converges at the optimal rate (Carmon et al., 2017). In this paper, we show that the vanilla trust-region update (5) and (6) already achieves the optimal convergence rate as the byproduct of our novel argument.

0:  Initialization , step size , number of iterations , construction of differential estimators and
1:  for  to  do
2:     Compute and by solving (8);
3:     ;
4:     if  then
5:        Output ;
6:     end if
7:  end for
MetaAlgorithm 1 Inexact Trust Region Method

3 Methodology

In this section, we first introduce a general inexact trust region method which is summarized in MetaAlgorithm 1. It accepts inexact gradient estimation and Hessian estimation as input to the QCQP subproblem

(8)

Similar to (5), the global solutions to (8) are characterized by Lemma 2.1 and we further denote the dual variable corresponding to the minimizer by . In practice, (8) can be efficiently solved by Lanczos method (Gould et al., 1999).

We prove that such inexact trust-region method achieves the optimal convergence rate when the estimation and at each iteration are sufficient close to their full (exact) counterparts and respectively:

(9)

Such result allows us to derive stochastic trust-region variants with novel differential estimators that are tailored to ensure the optimal convergence rate. We state our formal results in Theorem 3.1.

Theorem 3.1.

Consider problem (1) under Assumption 2.1-2.3. If the differential estimators and satisfy Eqn. (9) for all , MetaAlgorithm 1 finds an -SOSP in less than iterations by setting the trust-region radius as .

Proof.

For simplicity of notation, we denote

From Assumption 2.3 we have

Use the Cauchy–Schwarz inequality to obtain

(10)

The requirement (9) together with the trust region allow us to bound

(11)

The optimality of (5) indicates that there exists dual variable so that (Corollary 7.2.2 in (Conn et al., 2000))

First Order (12)
Second Order (13)
Complementary (14)

Multiplying (12) by , we have

(15)

Additionally, using (13) we have

which together with (15) gives

(16)

Moreover, the complementary property (14) indicates as we have in MetaAlgorithm. Plug (11), (15), and (16) into (10) and use :

(17)

Therefore, if we have , then

(18)

Using Assumption 2.1, we find in no more than iterations.

We now show that once , then is already an -SOSP: From (12), we have

(19)

The assumptions and together with the trust region imply

(20)

On the other hand use Assumption 2.3 to bound

Combining these two results gives .
Besides use Assumption 2.3, , and (13) we derive the Hessian lower bound

Hence is a -stationary point.

Therefore, we have according to the complementary condition (14) for all but the last iteration. ∎

Remark 3.1.

We emphasize that MetaAlgorithm 1 degenerates to the exact trust region method by taking and . Such result is of its own interest because this is the first proof to show that the vanilla trust region method has the optimal convergence rate. Similar rate is achieved by (Curtis et al., 2017) but with a complicated trust region variant.

Remark 3.2.

We note that MetaAlgorithm 1 uses the dual variable as stopping criterion, which enables the last-term convergence analysis in Theorem 3.1. In our appendix, we present a variant of MetaAlgorithm 1 without accessing to the exact dual variable. Further, we show such variant enjoys a similar convergence guarantee in expectation.

Theorem 3.1 shows the explicit step size control of the trust-region method: Since the dual variable satisfies for all but the last iteration, we always find the solution to the trust-region subproblem (8) in the boundary, i.e. , according to the complementary condition (14). Such exact step-size control property is missing in the cubic-regularzation method where the step-size is implicitly decided by the cubic regularization parameter.

More importantly, we emphasize that such explicit step size control is crucial to the sample efficiency of our variance reduced differential estimators. The essence of variance reduction is to exploit the correlations between the differentials in consecutive iterations. Intuitively, when two neighboring iterates are close, so are their differentials due to the Lipschitz continuity, and hence a smaller number of samples suffice to maintain the accuracy of the estimators. On the other hand, smaller step size reduces the per-iteration objective decrease which harms the convergence rate of the algorithm (see proof of Theorem 3.1). Therefore, the explicit step-size control in trust-region method allows us to well trade-off the per-iteration sample complexity and convergence rate, from which we can derive stochastic trust region approaches with state-of-the-art sample efficiency.

4 Stochastic Trust Region Method: Type I

0:  Initializer , step size , number of iterations
1:  for  to  do
2:     Construct gradient estimator by Estimator 4;
3:     Construct Hessian estimator by Estimator 3;
4:     Compute by solving (8);
5:     ;
6:     if  then
7:        Output ;
8:     end if
9:  end for
Algorithm 2 STR

Having the inexact trust-region method as prototype, we now derive our first sample-efficient stochastic trust region methods, namely STR, which emphasises more cheaper stochastic second-order oracle complexity. As Theorem 3.1 already guarantees the optimal convergence rate of MetaAlgorithm 1 when the gradient estimator and the Hessian estimator meet requirements (9), here we focus on constructing such novel differential estimators. Specifically, we first present our Hessian estimator in Estimator 3 and our first gradient estimator in Estimator 4, both of which exploit the trust region radius to reduce their variances. Further, by plugging Estimator 3 and Estimator 4 in MetaAlgorithm 1, we present STR in Algorithm 2 with state-of-the-art stochastic Hessian complexity.

4.1 Hessian Estimator

0:

  Epoch length

, sample size , (optional)
1:  if mod() then
2:     Option I:     high accuracy case (small )  ;
3:     Option II:    low accuracy case (moderate ) Draw samples indexed by ; ;
4:  else
5:     Draw samples indexed by ;
6:     ;
7:  end if
Estimator 3 Hessian Estimator

Our epoch-wise Hessian estimator is given in Estimator 3, where controls the epoch length and (and optionally ) controls the minibatch size. At the beginning of each epoch Estimator 3 has two options, designed for different target accuracy: Option I is preferable for the high accuracy case () where we compute the full Hessian to avoid approximation error, and Option II is designed for the moderate accuracy case () where we only need to an approximate Hessian estimator. Then, iterations follow with defined in a recurrent manner. These recurrent estimators exist for the first-order case (Nguyen et al., 2017; Fang et al., 2018), but their bound only holds under the vector norm. Here we generalize them into Hessian estimation with matrix spectrum norm bound.

The following lemma analyzes the amortized stochastic second-order oracle (Hessian) complexity for Algorithm 3 to meet the requirement in Theorem 3.1. As we need an approximation error bound under the spectrum norm, we will appeal to the matrix Azuma’s inequality (Tropp, 2012).

Lemma 4.1.

Assume Algorithm 2 takes the trust region radius as in Theorem 3.1. For any , Estimator 3 produces estimators for the second order differentials such that

with probability at least

if we set
1. and in option I, or
2. , , and in option II .
Consequently the amortized per-iteration stochastic second-order oracle complexity to construct is no more than

Proof.

Without loss of generality, we analyze the case for ease of notation. We first focus on Option II. The proof for Option I follows the similar argument.
Option II: Define for and

and define for and

is a martingale difference. We have for all and ,

Besides, using Assumption 2.2 for to bound

(21)

and using Assumption 2.3 for to bound

From the construction of , we have

Thus using the matrix Azuma’s Inequality in Theorem 7.1 of (Tropp, 2012) and , we have

Consequently, we have

by taking , , , and .

Option I: The proof is similar to the one of Option II except that we replace

with zero matrix. In such case, the matrix Azuma’s Inequality implies

Thus by taking , , and , we have the result.

Amortized Complexity: In option I, the choice of parameter ensures that: and in option II: . Consequently the amortized stochastic second-order oracle is bounded by . ∎

4.2 Gradient Estimator: Case (1)

1:  if mod() then
2:     
3:  else
4:     Draw samples indexed by ;
5:     ;
6:  end if
Estimator 4 Gradient Estimator: Case (1)

When stochastic second-order oracle complexity is prioritized, we directly employ the SPIDER gradient estimator to construct (Fang et al., 2018). Similar to the construction for , the estimator is also construct in an epoch-wise manner as presented in Estimator 4, where controls the epoch length and controls the minibatch size.

We now analyze the necessary stochastic first-order oracle complexity to meet the requirement in Theorem 3.1.

Lemma 4.2.

Assume Algorithm 2 takes the trust region radius . Estimator 4 produces estimator of the first order differential such that with probability at least for any , if we set and , where the constant . Consequently, the amortized per-iteration stochastic first-order oracle complexity to construct is .

The proof for Lemma 4.2 is similar to the one of Lemma 4.1 and is deferred to Appendix 7.1.
Lemma 4.2 and Lemma 4.1 only guarantee the differential estimators satisfy the requirement (9) in a single iteration and can be extended to hold for all by using the union bound with . Combining such lifted result with Theorem 3.1, we can establish the bound of computational complexity as follows.

Corollary 4.1.

Assume Algorithm 2 will use Estimator 4 to construct the first-order differential estimator and use Estimator 3 to construct the second-order differential estimator . To find an -SOSP with probability at least , the overall stochastic first-order oracle complexity is and the overall stochastic second-order oracle complexity is .

From Corollary 4.1 we see that stochastic second-order oracle queries are sufficient for STR to find an -SOSP which is significantly better than both the subsampled cubic regularization method (Kohler and Lucchi, 2017a) and the variance reduction based ones (Zhou et al., 2018b; Zhang et al., 2018).

5 Stochastic Trust Region Method: Type II

In the previous section, we focus on the setting where the stochastic second-order oracle complexity is prioritized over the stochastic first-order oracle complexity and STR achieves the state-of-the-art efficiency. In this section, we consider a different complexity measure where the first-order and second-order oracle complexities are treated equally and our goal is to minimize the maximum of them. We note that, currently the best result is of the SVRC method (Zhou et al., 2018c).

Since the Hessian estimator of STR already delivers the superior stochastic Hessian complexity, in STR (see Algorithm 5), we retain Estimator 3 for second-order differential estimation and use Estimator 6 to further reduce the stochastic gradient complexity.

0:  Initializer , step size , number of iterations
1:  for  to  do
2:     Construct gradient estimator by Estimator 6;
3:     Construct Hessian estimator by Estimator 3;
4:     Compute by solving (8);
5:     ;
6:     if  then
7:        Output ;
8:     end if
9:  end for
Algorithm 5 STR
1:  if mod() then
2:     Let
3:     
4:  else
5:     Draw samples indexed by ;
6:     ;
7:     ;
8:  end if
Estimator 6 Gradient Estimator: Case (2)
(a) a9a (b) ijcnn
(c) codrna (d) phishing
Figure 1:

Comparison on the logistic regression with nonconvex regularizer.

(a) a9a (b) w8a
(c) codrna
Figure 2: Comparison on the nonlinear least square problem.

5.1 Gradient Estimator: Case (2)

When the maximum of stochastic gradient and Hessian complexities, is prioritized, we use Hessian to improve the gradient estimation in Algorithm 4. Intuitively, the Mean Value Theorem gives

with for some , which under Assumption 2.3 allow us to bound

Such property can be used to improve Lemma 4.2 of Estimator 4. Specifically, define the correction term

where is some reference point updated in a epoch-wise manner. Estimator 6 adds to the estimator in Estimator 4. Note that in Estimator 6, the number of first-order and second-order oracle complexities are the same.

We now analyze the necessary first-order (and second-order) oracle complexity to meet requirement (9).

Lemma 5.1.

Assume Algorithm 5 takes the trust region size as in Theorem 3.1. For any , Estimator 6 produces estimator for the first order differential such that with probability at least , if we set and , with . Consequently the amortized per-iteration stochastic first-order oracle complexity to construct is .

The proof for Lemma 5.1 is similar to the one of Lemma 4.1 and is deferred to Appendix 7.2.

Similar to the previous section, Lemma 5.1 only guarantee the gradient estimator satisfies the requirement 9 in a single iteration. We extended such result to hold for all by using the union bound with , which together with Theorem 3.1 gives the following corollary.

Corollary 5.1.

Assume Algorithm 5 will use Estimator 6 to construct the first-order differential estimator and use Estimator 3 to construct the second-order differential estimator . To find an -SOSP with probability at least , the overall stochastic first-order oracle complexity is and the overall stochastic second-order oracle complexity is .

Corollary 5.1 shows that to find an -SOSP for Problem 1, both stochastic first-order and second-order oracle complexities of STR are which is better than the best existing result in (Zhou et al., 2018c).

6 Experiments

In this section, we compare the proposed STR with several state-of-the-art (stochastic) cubic regularized algorithms and trust region approaches, including trust region (TR) algorithm (Conn et al., 2000), adaptive cubic regularization (ARC) (Cartis et al., 2011), sub-sampled cubic regularization (SCR) (Kohler and Lucchi, 2017a), stochastic variance-reduced cubic (SVRC) (Zhou et al., 2018c) and Lite-SVRC (Zhou et al., 2018b). For STR, we estimate the gradient as the way in case (1). This is because such a method enjoys lower Hessian computational complexity over the way in case (2) and for most problems, computing their Hessian matrices is much more time-consuming than computing their gradients. For the subproblems in these compared methods, we use Lanczos method (Gould et al., 1999; Kohler and Lucchi, 2017a) to solve the sub-problem approximately in a Hessian-related Krylov subspace. We run simulations on five datasets from LibSVM (a09, ijcnn, codrna, phishing, and w08). The details of these datasets are described in Appendix 9. For all the considered algorithms, we tune their hyper-parameters optimally.

Two evaluation nonconvex problems. Following (Kohler and Lucchi, 2017a; Zhou et al., 2018c), we evaluate all considered algorithms on two learning tasks: the logistic regression with nonconvex regularizer and the nonlinear least square. Given data points where is the sample vector and is the label, logistic regression with nonconvex regularizer aims at distinguishing these two kinds of samples by solving the following problem

where the nonconvex regularizer is defined as . The nonlinear least square problem fits the nonlinear data by minimizing

For both these two kinds of problems, we set the parameters and for all testing datasets.

Figure 1 summarizes the testing results on the nonconvex logistic regression problems. For each dataset, we report the function value gap v.s. the overall algorithm running time which can reflect the overall computational complexity of an algorithm, and also show the function value gap v.s. Hessian sample complexity which reveals the complexity of Hessian computation. From Fig. 1, one can observe that our proposed STR algorithm runs faster than the compared algorithms in terms of the algorithm running time, showing the overall superiority of STR. Furthermore, STR also reveals much sharper convergence curves in terms of the Hessian sample complexity which is consistent with our theory. This is because to achieve an -accuracy local minimum, the Hessian sample complexity of the proposed STR is and is superior over the complexity of the compared methods (see the comparison in Sec. 4.2). Indeed, this also explains why our algorithm is also faster in terms of algorithm running time, since for most optimization problems, Hessian matrix is much more computationally expensive than the gradient and thus more efficient Hessian sample complexity means faster overall convergence speed.

Figure 2 displays the results of the compared algorithms on the nonlinear least square problems. STR shows very similar behaviors as those in Figure 1. More specifically, STR achieves fastest convergence rate in terms of both algorithm running time and Hessian sample complexity. On the codrna dataset (the bottom of Figure 2) we further plot the function value gap versus running-time curves and Hessian sample complexity. One can obverse that the gradient in STR vanishes significantly faster than other algorithms which means that STR can find the stationary point with high efficiency. See Figure 3 in Appendix 9.2 for more experimental results on running time comparison. All these results confirm the superiority of the proposed STR.

Conclusion

We proposed two stochastic trust region variants. Under two efficiency measurement settings (whether the stochastic first- and second-order oracle complexity are treated equally), the proposed method achieve state-of-the-art oracle complexity. Experimental results well testify our theoretical implications and the efficiency of the proposed algorithm.

References

7 Appendix

7.1 Proof of Lemma 4.2

Without loss of generality, we analyze the case for ease of notation. Define for and

is a martingale difference: for all and

Besides has bounded norm: Using Assumption 2.3

(22)

From the construction of , we have

Recall the Azuma’s Inequality. Using , we have

Take and denote . To ensure that