Optimization using Stochastic quasi-Newton methods
In this paper we study stochastic quasi-Newton methods for nonconvex stochastic optimization, where we assume that noisy information about the gradients of the objective function is available via a stochastic first-order oracle (SFO). We propose a general framework for such methods, for which we prove almost sure convergence to stationary points and analyze its worst-case iteration complexity. When a randomly chosen iterate is returned as the output of such an algorithm, we prove that in the worst-case, the SFO-calls complexity is O(ϵ^-2) to ensure that the expectation of the squared norm of the gradient is smaller than the given accuracy tolerance ϵ. We also propose a specific algorithm, namely a stochastic damped L-BFGS (SdLBFGS) method, that falls under the proposed framework. Moreover, we incorporate the SVRG variance reduction technique into the proposed SdLBFGS method, and analyze its SFO-calls complexity. Numerical results on a nonconvex binary classification problem using SVM, and a multiclass classification problem using neural networks are reported.READ FULL TEXT VIEW PDF
Optimization using Stochastic quasi-Newton methods
In this paper, we consider the following stochastic optimization problem:
where is continuously differentiable and possibly nonconvex,
denotes a random variable with distribution function, and denotes the expectation taken with respect to . In many cases the function is not given explicitly and/or the distribution function is unknown, or the function values and gradients of cannot be easily obtained and only noisy information about the gradient of is available. In this paper we assume that noisy gradients of can be obtained via calls to a stochastic first-order oracle (). Problem (1.1
) arises in many applications in statistics and machine learning[36, 52]
, mixed logit modeling problems in economics and transportation[7, 4, 26] as well as many other areas. A special case of (1.1) that arises frequently in machine learning is the empirical risk minimization problem
is the loss function corresponds to the-th sample data, and denotes the number of sample data and is assumed to be extremely large.
The idea of employing stochastic approximation (SA) to solve stochastic programming problems can be traced back to the seminal work by Robbins and Monro 
. The classical SA method, also referred to as stochastic gradient descent (SGD), mimics the steepest gradient descent method, i.e., it updates iteratevia , where the stochastic gradient
is an unbiased estimate of the gradientof at , and is the stepsize. The SA method has been studied extensively in [12, 17, 19, 44, 45, 49, 50], where the main focus has been the convergence of SA in different settings. Recently, there has been a lot of interest in analyzing the worst-case complexity of SA methods, stimulated by the complexity theory developed by Nesterov for first-order methods for solving convex optimization problems [42, 43]. Nemirovski et al.  proposed a mirror descent SA method for solving the convex stochastic programming problem , where is nonsmooth and convex and is a convex set, and proved that for any given , the method needs iterations to obtain an such that . Other SA methods with provable complexities for solving convex stochastic optimization problems have also been studied in [20, 28, 29, 30, 31, 32, 3, 14, 2, 53, 56].
Recently there has been a lot of interest in SA methods for stochastic optimization problem (1.1) in which is a nonconvex function. In , an SA method to minimize a general cost function was proposed by Bottou and proved to be convergent to stationary points. Ghadimi and Lan  proposed a randomized stochastic gradient (RSG) method that returns an iterate from a randomly chosen iteration as an approximate solution. It is shown in  that to return a solution such that , where denotes the Euclidean norm, the total number of -calls needed by RSG is . Ghadimi and Lan  also studied an accelerated SA method for solving (1.1) based on Nesterov’s accelerated gradient method [42, 43], which improved the -call complexity for convex cases from to . In , Ghadimi, Lan and Zhang proposed a mini-batch SA method for solving problems in which the objective function is a composition of a nonconvex smooth function and a convex nonsmooth function, and analyzed its worst-case -call complexity. In , a method that incorporates a block-coordinate decomposition scheme into stochastic mirror-descent methodology, was proposed by Dang and Lan for a nonconvex stochastic optimization problem in which the convex set has a block structure. More recently, Wang, Ma and Yuan  proposed a penalty method for nonconvex stochastic optimization problems with nonconvex constraints, and analyzed its -call complexity.
In this paper, we study stochastic quasi-Newton (SQN) methods for solving the nonconvex stochastic optimization problem (1.1). In the deterministic optimization setting, quasi-Newton methods are more robust and achieve higher accuracy than gradient methods, because they use approximate second-order derivative information. Quasi-Newton methods usually employ the following updates for solving (1.1):
where and . By using the Sherman-Morrison-Woodbury formula, it is easy to derive that the equivalent update to is
where . For stochastic optimization, there has been some work in designing stochastic quasi-Newton methods that update the iterates via (1.3) using the stochastic gradient in place of . Specific examples include the following. The adaptive subgradient (AdaGrad) method proposed by Duchi, Hazan and Singer , which takes
to be a diagonal matrix that estimates the diagonal of the square root of the uncentered covariance matrix of the gradients, has been proven to be quite efficient in practice. In, Bordes, Bottou and Gallinari studied SGD with a diagonal rescaling matrix based on the secant condition associated with quasi-Newton methods. Roux and Fitzgibbon  discussed the necessity of including both Hessian and covariance matrix information in a stochastic Newton type method. Byrd et al. 
proposed a quasi-Newton method that uses the sample average approximation (SAA) approach to estimate Hessian-vector multiplications. In, Byrd et al. proposed a stochastic limited-memory BFGS (L-BFGS)  method based on SA, and proved its convergence for strongly convex problems. Stochastic BFGS and L-BFGS methods were also studied for online convex optimization by Schraudolph, Yu and Günter in . For strongly convex problems, Mokhtari and Ribeiro proposed a regularized stochastic BFGS method (RES) and analyzed its convergence in  and studied an online L-BFGS method in . Recently, Moritz, Nishihara and Jordan  proposed a linearly convergent method that integrates the L-BFGS method in  with the variance reduction technique (SVRG) proposed by Johnson and Zhang in  to alleviate the effect of noisy gradients. A related method that incorporates SVRG into a quasi-Newton method was studied by Lucchi, McWilliams and Hofmann in . In , Gower, Goldfarb and Richtárik proposed a variance reduced block L-BFGS method that converges linearly for convex functions. It should be noted that all of the above stochastic quasi-Newton methods are designed for solving convex or even strongly convex problems.
Challenges. The key challenge in designing stochastic quasi-Newton methods for nonconvex problem lies in the difficulty in preserving the positive-definiteness of (and ), due to the non-convexity of the problem and the presence of noise in estimating the gradient. It is known that the BFGS update (1.4) preserves the positive-definiteness of as long as the curvature condition
holds, which can be guaranteed for strongly convex problem. For nonconvex problem, the curvature condition (1.6) can be satisfied by performing a line search. However, doing this is no longer feasible for (1.1) in the stochastic setting, because exact function values and gradient information are not available. As a result, an important issue in designing stochastic quasi-Newton methods for nonconvex problems is how to preserve the positive-definiteness of (or ) without line search.
Our contributions. Our contributions (and where they appear) in this paper are as follows.
We propose a general framework for stochastic quasi-Newton methods (SQN) for solving the nonconvex stochastic optimization problem (1.1), and prove its almost sure convergence to a stationary point when the step size is diminishing. We also prove that the number of iterations needed to obtain , is , for chosen proportional to , where is a constant. (See Section 2)
When a randomly chosen iterate is returned as the output of SQN, we prove that the worst-case -calls complexity is to guarantee . (See Section 2.2)
We propose a stochastic damped L-BFGS (SdLBFGS) method that fits into the proposed framework. This method adaptively generates a positive definite matrix that approximates the inverse Hessian matrix at the current iterate . Convergence and complexity results for this method are provided. Moreover, our method does not generate explicitly, and only its multiplication with vectors is computed directly. (See Section 3)
In this section, we study SQN methods for the (possibly nonconvex) stochastic optimization problem (1.1). We assume that an outputs a stochastic gradient of for a given , where is a random variable whose distribution is supported on . Here we assume that does not depend on .
We now give some assumptions that are required throughout this paper.
is continuously differentiable. is lower bounded by a real number for any . is globally Lipschitz continuous with Lipschitz constant ; namely for any ,
For any iteration , we have
where is the noise level of the gradient estimation, and , , are independent samples, and for a given the random variable is independent of .
Analogous to deterministic quasi-Newton methods, our SQN method takes steps
where is defined as a mini-batch estimate of the gradient:
and denotes the random variable generated by the -th sampling in the -th iteration. From AS.2 we can see that has the following properties:
There exist two positive constants such that
where the notation with means that is positive semidefinite.
We denote by , the random samplings in the -th iteration, and denote by , the random samplings in the first iterations. Since is generated iteratively based on historical gradient information by a random process, we make the following assumption on to control the randomness (note that is given in the initialization step).
For any , the random variable depends only on .
It then follows directly from AS.4 and (2.6) that
where the expectation is taken with respect to generated in the computation of .
We will not specify how to compute until Section 3, where a specific updating scheme for satisfying both assumptions AS.3 and AS.4 will be proposed.
In this subsection, we analyze the convergence and complexity of SQN under the condition that the step size in (2.4) is diminishing. Specifically, in this subsection we assume satisfies the following condition:
Suppose that is generated by SQN and assumptions AS.1-4 hold. Further assume that (2.8) holds, and for all . (Note that this can be satisfied if is non-increasing and the initial step size ). Then the following inequality holds
where the conditional expectation is taken with respect to .
Define . From (2.4), and assumptions AS.1 and AS.3, we have
Taking expectation with respect to on both sides of (2.10) conditioned on , we obtain,
which together with (2.11) and AS.3 yields that
Before proceeding further, we introduce the definition of a supermartingale (see  for more details).
Let be an increasing sequence of -algebras. If is a stochastic process satisfying (i) ; (ii) for all ; and (iii) for all , then is called a supermartingale.
If is a nonnegative supermartingale, then almost surely and .
We are now ready to give convergence results for SQN (Algorithm 2.1).
Suppose that assumptions AS.1-4 hold for generated by SQN with batch size for all . If the stepsize satisfies (2.8) and for all , then it holds that
Moreover, there exists a positive constant such that
Define and . Let be the -algebra measuring , and . From (2.9) we know that for any , it holds that
which implies that Since , we have , which implies (2.15). According to Definition 2.1, is a supermartingale. Therefore, Proposition 2.1 shows that there exists a such that with probability 1, and . Note that from (2.16) we have . Thus,
which further yields that
Since , it follows that (2.14) holds.
Then from (2.17) it follows that
which implies that
According to (2.12), we have that
which together with (2.20) implies that with probability 1, as . Hence, from the Lipschitz continuity of , it follows that with probability 1 as . However, this contradicts (2.19). Therefore, the assumption that (2.18) does not hold is not true.
Note that our result in Theorem 2.2 is stronger than the ones given in existing works such as  and . Moreover, although Bottou  also proves that the SA method for nonconvex stochastic optimization with diminishing stepsize is almost surely convergent to stationary point, our analysis requires weaker assumptions. For example,  assumes that the objective function is three times continuously differentiable, while our analysis does not require this. Furthermore, we are able to analyze the iteration complexity of SQN, for a specifically chosen step size (see Theorem 2.3 below), which is not provided in .
We now analyze the iteration complexity of SQN.
Suppose that assumptions AS.1-4 hold for generated by SQN with batch size for all . We also assume that is specifically chosen as
with . Note that this choice satisfies (2.8) and for all . Then
where denotes the iteration number. Moreover, for a given , to guarantee that , the number of iterations needed is at most .
Taking expectation on both sides of (2.9) and summing over yields
Since , it follows that the number of iterations needed is at most .
We analyze the -calls complexity of SQN when the output is randomly chosen from , where is the maximum iteration number. Our results in this subsection are motivated by the randomized stochastic gradient (RSG) method proposed by Ghadimi and Lan . RSG runs SGD for iterations, where is a randomly chosen integer from with a specifically defined probability mass function . In  it is proved that under certain conditions on the step size and , -calls are needed by SGD to guarantee . We show below that under similar conditions, the same complexity holds for our SQN.
Suppose that assumptions AS.1-4 hold, and that in SQN (Algorithm 2.1) is chosen such that for all with for at least one . Moreover, for a given integer , let be a random variable with the probability mass function
Then we have
where and the expectation is taken with respect to and . Moreover, if we choose and for all , then (2.25) reduces to
From (2.10) it follows that
where . Now summing and noticing that , yields
By AS.2 and AS.4 we have that
It follows from the definition of in (2.24) that
Note that in Theorem 2.4, ’s are not required to be diminishing, and they can be constant as long as they are upper bounded by .
We now show that the complexity of SQN with random output and constant step size is .