# Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization

In this paper we study stochastic quasi-Newton methods for nonconvex stochastic optimization, where we assume that noisy information about the gradients of the objective function is available via a stochastic first-order oracle (SFO). We propose a general framework for such methods, for which we prove almost sure convergence to stationary points and analyze its worst-case iteration complexity. When a randomly chosen iterate is returned as the output of such an algorithm, we prove that in the worst-case, the SFO-calls complexity is O(ϵ^-2) to ensure that the expectation of the squared norm of the gradient is smaller than the given accuracy tolerance ϵ. We also propose a specific algorithm, namely a stochastic damped L-BFGS (SdLBFGS) method, that falls under the proposed framework. Moreover, we incorporate the SVRG variance reduction technique into the proposed SdLBFGS method, and analyze its SFO-calls complexity. Numerical results on a nonconvex binary classification problem using SVM, and a multiclass classification problem using neural networks are reported.

## Authors

• 82 publications
• 24 publications
• 16 publications
• 289 publications
03/09/2018

### A Stochastic Semismooth Newton Method for Nonsmooth Nonconvex Optimization

In this work, we present a globalized stochastic semismooth Newton metho...
03/07/2021

### Retrospective Approximation for Smooth Stochastic Optimization

We consider stochastic optimization problems where a smooth (and potenti...
02/17/2020

### Stochastic Gauss-Newton Algorithms for Nonconvex Compositional Optimization

We develop two new stochastic Gauss-Newton algorithms for solving a clas...
06/12/2021

### Distributionally Robust Optimization with Markovian Data

We study a stochastic program where the probability distribution of the ...
09/24/2021

### Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization

We consider unconstrained stochastic optimization problems with no avail...
04/14/2021

### Oracle Complexity in Nonsmooth Nonconvex Optimization

It is well-known that given a smooth, bounded-from-below, and possibly n...
01/29/2020

### Complexity Analysis of a Stochastic Cubic Regularisation Method under Inexact Gradient Evaluations and Dynamic Hessian Accuracy

We here adapt an extended version of the adaptive cubic regularisation m...

## Code Repositories

### minSQN

Optimization using Stochastic quasi-Newton methods

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we consider the following stochastic optimization problem:

 minx∈Rnf(x)=E[F(x,ξ)], (1.1)

where is continuously differentiable and possibly nonconvex,

denotes a random variable with distribution function

, and denotes the expectation taken with respect to . In many cases the function is not given explicitly and/or the distribution function is unknown, or the function values and gradients of cannot be easily obtained and only noisy information about the gradient of is available. In this paper we assume that noisy gradients of can be obtained via calls to a stochastic first-order oracle (). Problem (1.1

) arises in many applications in statistics and machine learning

[36, 52]

, mixed logit modeling problems in economics and transportation

[7, 4, 26] as well as many other areas. A special case of (1.1) that arises frequently in machine learning is the empirical risk minimization problem

 minx∈Rnf(x)=1TT∑i=1fi(x), (1.2)

where

is the loss function corresponds to the

-th sample data, and denotes the number of sample data and is assumed to be extremely large.

The idea of employing stochastic approximation (SA) to solve stochastic programming problems can be traced back to the seminal work by Robbins and Monro [47]

. The classical SA method, also referred to as stochastic gradient descent (SGD), mimics the steepest gradient descent method, i.e., it updates iterate

via , where the stochastic gradient

is an unbiased estimate of the gradient

of at , and is the stepsize. The SA method has been studied extensively in [12, 17, 19, 44, 45, 49, 50], where the main focus has been the convergence of SA in different settings. Recently, there has been a lot of interest in analyzing the worst-case complexity of SA methods, stimulated by the complexity theory developed by Nesterov for first-order methods for solving convex optimization problems [42, 43]. Nemirovski et al. [41] proposed a mirror descent SA method for solving the convex stochastic programming problem , where is nonsmooth and convex and is a convex set, and proved that for any given , the method needs iterations to obtain an such that . Other SA methods with provable complexities for solving convex stochastic optimization problems have also been studied in [20, 28, 29, 30, 31, 32, 3, 14, 2, 53, 56].

Recently there has been a lot of interest in SA methods for stochastic optimization problem (1.1) in which is a nonconvex function. In [6], an SA method to minimize a general cost function was proposed by Bottou and proved to be convergent to stationary points. Ghadimi and Lan [21] proposed a randomized stochastic gradient (RSG) method that returns an iterate from a randomly chosen iteration as an approximate solution. It is shown in [21] that to return a solution such that , where denotes the Euclidean norm, the total number of -calls needed by RSG is . Ghadimi and Lan [22] also studied an accelerated SA method for solving (1.1) based on Nesterov’s accelerated gradient method [42, 43], which improved the -call complexity for convex cases from to . In [23], Ghadimi, Lan and Zhang proposed a mini-batch SA method for solving problems in which the objective function is a composition of a nonconvex smooth function and a convex nonsmooth function, and analyzed its worst-case -call complexity. In [13], a method that incorporates a block-coordinate decomposition scheme into stochastic mirror-descent methodology, was proposed by Dang and Lan for a nonconvex stochastic optimization problem in which the convex set has a block structure. More recently, Wang, Ma and Yuan [55] proposed a penalty method for nonconvex stochastic optimization problems with nonconvex constraints, and analyzed its -call complexity.

In this paper, we study stochastic quasi-Newton (SQN) methods for solving the nonconvex stochastic optimization problem (1.1). In the deterministic optimization setting, quasi-Newton methods are more robust and achieve higher accuracy than gradient methods, because they use approximate second-order derivative information. Quasi-Newton methods usually employ the following updates for solving (1.1):

 xk+1=xk−αkB−1k∇f(xk),orxk+1=xk−αkHk∇f(xk) (1.3)

where is an approximation to the Hessian matrix at , or is an approximation to . The most widely-used quasi-Newton method, the BFGS method [8, 18, 24, 54] updates via

 Bk=Bk−1+yk−1y⊤k−1s⊤k−1yk−1−Bk−1sk−1s⊤k−1Bk−1s⊤k−1Bk−1sk−1, (1.4)

where and . By using the Sherman-Morrison-Woodbury formula, it is easy to derive that the equivalent update to is

 Hk=(I−ρk−1sk−1y⊤k−1)Hk−1(I−ρk−1yk−1s⊤k−1)+ρk−1sk−1s⊤k−1, (1.5)

where . For stochastic optimization, there has been some work in designing stochastic quasi-Newton methods that update the iterates via (1.3) using the stochastic gradient in place of . Specific examples include the following. The adaptive subgradient (AdaGrad) method proposed by Duchi, Hazan and Singer [15], which takes

to be a diagonal matrix that estimates the diagonal of the square root of the uncentered covariance matrix of the gradients, has been proven to be quite efficient in practice. In

[5], Bordes, Bottou and Gallinari studied SGD with a diagonal rescaling matrix based on the secant condition associated with quasi-Newton methods. Roux and Fitzgibbon [48] discussed the necessity of including both Hessian and covariance matrix information in a stochastic Newton type method. Byrd et al. [9]

proposed a quasi-Newton method that uses the sample average approximation (SAA) approach to estimate Hessian-vector multiplications. In

[10], Byrd et al. proposed a stochastic limited-memory BFGS (L-BFGS) [34] method based on SA, and proved its convergence for strongly convex problems. Stochastic BFGS and L-BFGS methods were also studied for online convex optimization by Schraudolph, Yu and Günter in [51]. For strongly convex problems, Mokhtari and Ribeiro proposed a regularized stochastic BFGS method (RES) and analyzed its convergence in [38] and studied an online L-BFGS method in [39]. Recently, Moritz, Nishihara and Jordan [40] proposed a linearly convergent method that integrates the L-BFGS method in [10] with the variance reduction technique (SVRG) proposed by Johnson and Zhang in [27] to alleviate the effect of noisy gradients. A related method that incorporates SVRG into a quasi-Newton method was studied by Lucchi, McWilliams and Hofmann in [35]. In [25], Gower, Goldfarb and Richtárik proposed a variance reduced block L-BFGS method that converges linearly for convex functions. It should be noted that all of the above stochastic quasi-Newton methods are designed for solving convex or even strongly convex problems.

Challenges. The key challenge in designing stochastic quasi-Newton methods for nonconvex problem lies in the difficulty in preserving the positive-definiteness of (and ), due to the non-convexity of the problem and the presence of noise in estimating the gradient. It is known that the BFGS update (1.4) preserves the positive-definiteness of as long as the curvature condition

 s⊤k−1yk−1>0 (1.6)

holds, which can be guaranteed for strongly convex problem. For nonconvex problem, the curvature condition (1.6) can be satisfied by performing a line search. However, doing this is no longer feasible for (1.1) in the stochastic setting, because exact function values and gradient information are not available. As a result, an important issue in designing stochastic quasi-Newton methods for nonconvex problems is how to preserve the positive-definiteness of (or ) without line search.

Our contributions. Our contributions (and where they appear) in this paper are as follows.

• We propose a general framework for stochastic quasi-Newton methods (SQN) for solving the nonconvex stochastic optimization problem (1.1), and prove its almost sure convergence to a stationary point when the step size is diminishing. We also prove that the number of iterations needed to obtain , is , for chosen proportional to , where is a constant. (See Section 2)

• When a randomly chosen iterate is returned as the output of SQN, we prove that the worst-case -calls complexity is to guarantee . (See Section 2.2)

• We propose a stochastic damped L-BFGS (SdLBFGS) method that fits into the proposed framework. This method adaptively generates a positive definite matrix that approximates the inverse Hessian matrix at the current iterate . Convergence and complexity results for this method are provided. Moreover, our method does not generate explicitly, and only its multiplication with vectors is computed directly. (See Section 3)

• Motivated by the recent advance of SVRG for nonconvex minimization [46, 1], we propose a variance reduced variant of SdLBFGS and analyze its -calls complexity. (See Section 4)

## 2 A general framework for stochastic quasi-Newton methods for nonconvex optimization

In this section, we study SQN methods for the (possibly nonconvex) stochastic optimization problem (1.1). We assume that an outputs a stochastic gradient of for a given , where is a random variable whose distribution is supported on . Here we assume that does not depend on .

We now give some assumptions that are required throughout this paper.

• is continuously differentiable. is lower bounded by a real number for any . is globally Lipschitz continuous with Lipschitz constant ; namely for any ,

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥.
• For any iteration , we have

 a) Eξk[g(xk,ξk)]=∇f(xk), (2.1) b) Eξk[∥g(xk,ξk)−∇f(xk)∥2]≤σ2, (2.2)

where is the noise level of the gradient estimation, and , , are independent samples, and for a given the random variable is independent of .

###### Remark 2.1

Note that the stochastic BFGS methods studied in [38, 10, 39] require that the noisy gradient is bounded, i.e.,

 Eξk[∥g(xk,ξk)∥2]≤Mg, (2.3)

where is a constant. Our assumption (2.2) is weaker than (2.3).

Analogous to deterministic quasi-Newton methods, our SQN method takes steps

 xk+1=xk−αkHkgk, (2.4)

where is defined as a mini-batch estimate of the gradient:

 gk=1mkmk∑i=1g(xk,ξk,i), (2.5)

and denotes the random variable generated by the -th sampling in the -th iteration. From AS.2 we can see that has the following properties:

 E[gk|xk]=∇f(xk),E[∥gk−∇f(xk)∥2|xk]≤σ2mk. (2.6)
• There exist two positive constants such that

 κ––I⪯Hk⪯¯κI, for all k,

where the notation with means that is positive semidefinite.

We denote by , the random samplings in the -th iteration, and denote by , the random samplings in the first iterations. Since is generated iteratively based on historical gradient information by a random process, we make the following assumption on to control the randomness (note that is given in the initialization step).

• For any , the random variable depends only on .

It then follows directly from AS.4 and (2.6) that

 E[Hkgk|ξ[k−1]]=Hk∇f(xk), (2.7)

where the expectation is taken with respect to generated in the computation of .

We will not specify how to compute until Section 3, where a specific updating scheme for satisfying both assumptions AS.3 and AS.4 will be proposed.

We now present our SQN method for solving (1.1) as Algorithm 2.1.

### 2.1 Convergence and complexity of SQN with diminishing step size

In this subsection, we analyze the convergence and complexity of SQN under the condition that the step size in (2.4) is diminishing. Specifically, in this subsection we assume satisfies the following condition:

 +∞∑k=1αk=+∞,+∞∑k=1α2k<+∞, (2.8)

which is a standard assumption in stochastic approximation algorithms (see, e.g., [10, 39, 41]). One very simple choice of that satisfies (2.8) is .

The following lemma shows that a descent property in terms of the expected objective value holds for SQN. Our analysis is similar to analyses that have been used in [6, 38].

###### Lemma 2.1

Suppose that is generated by SQN and assumptions AS.1-4 hold. Further assume that (2.8) holds, and for all . (Note that this can be satisfied if is non-increasing and the initial step size ). Then the following inequality holds

 E[f(xk+1)|xk]≤f(xk)−12αkκ––∥∇f(xk)∥2+Lσ2¯κ22mkα2k,∀k≥1, (2.9)

where the conditional expectation is taken with respect to .

Define . From (2.4), and assumptions AS.1 and AS.3, we have

 f(xk+1)≤ f(xk)+⟨∇f(xk),xk+1−xk⟩+L2∥xk+1−xk∥2 = f(xk)−αk⟨∇f(xk),Hkgk⟩+L2α2k∥Hkgk∥2 ≤ f(xk)−αk⟨∇f(xk),Hk∇f(xk)⟩−αk⟨∇f(xk),Hkδk⟩+L2α2k¯κ2∥gk∥2. (2.10)

Taking expectation with respect to on both sides of (2.10) conditioned on , we obtain,

 E[f(xk+1)|xk]≤f(xk)−αk⟨∇f(xk),Hk∇f(xk)⟩+L2α2k¯κ2E[∥gk∥2|xk], (2.11)

where we used (2.7) and the fact that . From (2.6) and , it follows that

 E[∥gk∥2|xk] =E[∥gk−∇f(xk)+∇f(xk)∥2|xk] =E[∥∇f(xk)∥2|xk]+E[∥gk−∇f(xk)∥2|xk]+2E[⟨δk,∇f(xk)⟩|xk] =∥∇f(xk)∥2+E[∥gk−∇f(xk)∥2|xk]≤∥∇f(xk)∥2+σ2/mk, (2.12)

which together with (2.11) and AS.3 yields that

 E[f(xk+1)|xk]≤f(xk)−(αkκ––−L2α2k¯κ2)∥∇f(xk)∥2+Lσ2¯κ22mkα2k. (2.13)

Then (2.13) combined with the assumption implies (2.9).

Before proceeding further, we introduce the definition of a supermartingale (see [16] for more details).

###### Definition 2.1

Let be an increasing sequence of -algebras. If is a stochastic process satisfying (i) ; (ii) for all ; and (iii) for all , then is called a supermartingale.

###### Proposition 2.1 (see, e.g., Theorem 5.2.9 in [16])

If is a nonnegative supermartingale, then almost surely and .

We are now ready to give convergence results for SQN (Algorithm 2.1).

###### Theorem 2.1

Suppose that assumptions AS.1-4 hold for generated by SQN with batch size for all . If the stepsize satisfies (2.8) and for all , then it holds that

 liminfk→∞∥∇f(xk)∥=0, with probability 1. (2.14)

Moreover, there exists a positive constant such that

 E[f(xk)]≤Mf,∀k. (2.15)

Define and . Let be the -algebra measuring , and . From (2.9) we know that for any , it holds that

 E[γk+1|Fk] =E[f(xk+1)|Fk]+Lσ2¯κ22m∞∑i=k+1α2i ≤f(xk)−αkκ––2∥∇f(xk)∥2+Lσ2¯κ22m∞∑i=kα2i=γk−βk, (2.16)

which implies that Since , we have , which implies (2.15). According to Definition 2.1, is a supermartingale. Therefore, Proposition 2.1 shows that there exists a such that with probability 1, and . Note that from (2.16) we have . Thus,

 E[∞∑k=1βk]≤∞∑k=1(E[γk]−E[γk+1])<+∞,

which further yields that

 ∞∑k=1βk=κ––2∞∑k=1αk∥∇f(xk)∥2<+∞ with probability 1. (2.17)

Since , it follows that (2.14) holds.

Under the assumption (2.3) used in [38, 10, 39], we now prove a stronger convergence result showing that any limit point of generated by SQN is a stationary point of (1.1) with probability 1.

###### Theorem 2.2

Assume the same assumptions hold as in Theorem 2.1, and that (2.3) holds. Then

 limk→∞∥∇f(xk)∥=0, with probability 1. (2.18)

For any given , according to (2.14), there exist infinitely many iterates such that . Then if (2.18) does not hold, there must exist two infinite sequences of indices , with , such that for

 ∥∇f(xmi)∥≥2ϵ,∥∇f(xni)∥<ϵ,∥∇f(xk)∥≥ϵ,k=mi+1,…,ni−1. (2.19)

Then from (2.17) it follows that

 +∞>+∞∑k=1αk∥∇f(xk)∥2≥+∞∑i=1ni−1∑k=miαk∥∇f(xk)∥2≥ϵ2+∞∑i=1ni−1∑k=miαk,with probability 1,

which implies that

 ni−1∑k=miαk→0, with probability 1,  as i→+∞. (2.20)

According to (2.12), we have that

 E[∥xk+1−xk∥|xk]=αkE[∥Hkgk∥|xk]≤αk¯κE[∥gk∥|xk]≤αk¯κ(E[∥gk∥2|xk])12≤αk¯κ(Mg/m)12, (2.21)

where the last inequality is due to (2.3) and the convexity of . Then it follows from (2.21) that

 E[∥xni−xmi∥]≤¯κ(Mg/m)12ni−1∑k=miαk,

which together with (2.20) implies that with probability 1, as . Hence, from the Lipschitz continuity of , it follows that with probability 1 as . However, this contradicts (2.19). Therefore, the assumption that (2.18) does not hold is not true.

###### Remark 2.2

Note that our result in Theorem 2.2 is stronger than the ones given in existing works such as [38] and [39]. Moreover, although Bottou [6] also proves that the SA method for nonconvex stochastic optimization with diminishing stepsize is almost surely convergent to stationary point, our analysis requires weaker assumptions. For example, [6] assumes that the objective function is three times continuously differentiable, while our analysis does not require this. Furthermore, we are able to analyze the iteration complexity of SQN, for a specifically chosen step size (see Theorem 2.3 below), which is not provided in [6].

We now analyze the iteration complexity of SQN.

###### Theorem 2.3

Suppose that assumptions AS.1-4 hold for generated by SQN with batch size for all . We also assume that is specifically chosen as

 αk=κ––L¯κ2k−β (2.22)

with . Note that this choice satisfies (2.8) and for all . Then

 1NN∑k=1E[∥∇f(xk)∥2]≤2L(Mf−flow)¯κ2κ––2Nβ−1+σ2(1−β)m(N−β−N−1), (2.23)

where denotes the iteration number. Moreover, for a given , to guarantee that , the number of iterations needed is at most .

Taking expectation on both sides of (2.9) and summing over yields

 12κ––N∑k=1E[∥∇f(xk)∥2] ≤N∑k=11αk(E[f(xk)]−E[f(xk+1)])+Lσ2¯κ22mN∑k=1αk =1α1f(x1)+N∑k=2(1αk−1αk−1)E[f(xk)]−E[f(xN+1)]αN+Lσ2¯κ22mN∑k=1αk ≤Mfα1+MfN∑k=2(1αk−1αk−1)−flowαN+Lσ2¯κ22mN∑k=1αk =Mf−flowαN+Lσ2¯κ22mN∑k=1αk ≤L(Mf−flow)¯κ2κ––Nβ+σ2κ––2(1−β)m(N1−β−1),

which results in (2.23), where the second inequality is due to (2.15) and the last inequality is due to (2.22). Then for a given , to guarantee that , it suffices to require that

 2L(Mf−flow)¯κ2κ––2Nβ−1+σ2(1−β)m(N−β−N−1)<ϵ.

Since , it follows that the number of iterations needed is at most .

###### Remark 2.3

Note that Theorem 2.3 also provides iteration complexity analysis for the classic SGD method, which can be regarded as a special case of SQN with . To the best of our knowledge, our complexity result in Theorem 2.3 is new for both SGD and stochastic quasi-Newton methods.

### 2.2 Complexity of SQN with random output and constant step size

We analyze the -calls complexity of SQN when the output is randomly chosen from , where is the maximum iteration number. Our results in this subsection are motivated by the randomized stochastic gradient (RSG) method proposed by Ghadimi and Lan [21]. RSG runs SGD for iterations, where is a randomly chosen integer from with a specifically defined probability mass function . In [21] it is proved that under certain conditions on the step size and , -calls are needed by SGD to guarantee . We show below that under similar conditions, the same complexity holds for our SQN.

###### Theorem 2.4

Suppose that assumptions AS.1-4 hold, and that in SQN (Algorithm 2.1) is chosen such that for all with for at least one . Moreover, for a given integer , let be a random variable with the probability mass function

 PR(k):=Prob{R=k}=αkκ––−α2kL¯κ2/2∑Nk=1(αkκ––−α2kL¯κ2/2),k=1,…,N. (2.24)

Then we have

 E[∥∇f(xR)∥2]≤Df+(σ2L¯κ2)/2∑Nk=1(α2k/mk)∑Nk=1(αkκ––−α2kL¯κ2/2), (2.25)

where and the expectation is taken with respect to and . Moreover, if we choose and for all , then (2.25) reduces to

 E[∥∇f(xR)∥2]≤2L¯κ2DfNκ––2+σ2m. (2.26)

From (2.10) it follows that

 f(xk+1)≤ f(xk)−αk⟨∇f(xk),Hk∇f(xk)⟩−αk⟨∇f(xk),Hkδk⟩+ L2α2k¯κ2[∥∇f(xk)∥2+2⟨∇f(xk),δk⟩+∥δk∥2] ≤ f(xk)−(αkκ––−12α2kL¯κ2)∥∇f(xk)∥2+12α2kL¯κ2∥δk∥2+α2kL¯κ2⟨∇f(xk),δk⟩ −αk⟨∇f(xk),Hkδk⟩,

where . Now summing and noticing that , yields

 N∑k=1(αkκ––−L¯κ22α2k)∥∇f(xk)∥2 ≤ f(x1)−flow+L¯κ22N∑k=1α2k∥δk∥2+N∑k=1(α2kL¯κ2⟨∇f(xk),δk⟩−αk⟨∇f(xk),Hkδk⟩). (2.27)

By AS.2 and AS.4 we have that

 Eξk[⟨∇f(xk),δk⟩|ξ[k−1]]=0, and Eξk[⟨∇f(xk),Hkδk⟩|ξ[k−1]]=0.

Moreover, from (2.6) it follows that . Therefore, taking the expectation on both sides of (2.27) with respect to yields

 N∑k=1(αkκ––−α2kL¯κ2/2)Eξ[N][∥∇f(xk)∥2]≤f(x1)−flow+L¯κ2σ22N∑k=1α2kmk. (2.28)

It follows from the definition of in (2.24) that

 E[∥∇f(xR)∥2]=ER,ξ[N][∥∇f(xR)∥2]=∑Nk=1(αkκ––−α2kL¯κ2/2)Eξ[N][∥∇f(xk)∥2]∑Nk=1(αkκ––−α2kL¯κ2/2), (2.29)

which together with (2.28) implies (2.25).

###### Remark 2.4

Note that in Theorem 2.4, ’s are not required to be diminishing, and they can be constant as long as they are upper bounded by .

We now show that the complexity of SQN with random output and constant step size is .

###### Corollary 2.5

Assume the conditions in Theorem 2.4 hold, and and for all . Let be the total number of -calls needed to calculate stochastic gradients in SQN (Algorithm 2.1). For a given accuracy tolerance , we assume that

 ¯N≥max{C21ϵ