# Stochastic L-BFGS: Improved Convergence Rates and Practical Acceleration Strategies

We revisit the stochastic limited-memory BFGS (L-BFGS) algorithm. By proposing a new framework for the convergence analysis, we prove improved convergence rates and computational complexities of the stochastic L-BFGS algorithms compared to previous works. In addition, we propose several practical acceleration strategies to speed up the empirical performance of such algorithms. We also provide theoretical analyses for most of the strategies. Experiments on large-scale logistic and ridge regression problems demonstrate that our proposed strategies yield significant improvements vis-à-vis competing state-of-the-art algorithms.

## Authors

• 6 publications
• 8 publications
• 77 publications
06/28/2018

### A Simple Stochastic Variance Reduced Algorithm with Fast Convergence Rates

Recent years have witnessed exciting progress in the study of stochastic...
01/07/2021

### Accelerated, Optimal, and Parallel: Some Results on Model-Based Stochastic Optimization

We extend the Approximate-Proximal Point (aProx) family of model-based m...
10/29/2020

### Convergence of Constrained Anderson Acceleration

We prove non asymptotic linear convergence rates for the constrained And...
06/10/2021

### A Continuized View on Nesterov Acceleration for Stochastic Gradient Descent and Randomized Gossip

We introduce the continuized Nesterov acceleration, a close variant of N...
05/16/2021

### Analysis of target data-dependent greedy kernel algorithms: Convergence rates for f-, f · P- and f/P-greedy

Data-dependent greedy algorithms in kernel spaces are known to provide f...
12/08/2019

### Histogram Transform Ensembles for Large-scale Regression

We propose a novel algorithm for large-scale regression problems named h...
11/24/2015

### Performance Limits of Stochastic Sub-Gradient Learning, Part I: Single Agent Case

In this work and the supporting Part II, we examine the performance of s...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

We are interested in the following (unconstrained) convex finite-sum minimization problem

 minx∈Rd[f(x)≜1nn∑i=1fi(x)], (1)

where and

denote the ambient dimension of the decision vector and the number of component functions respectively. Problems in the form of (

1

) play important roles in machine learning and signal processing. One important class of such problems is the

empirical risk minimization (ERM) problem, where each assumes the form

 fi(x)≜ℓ(aTix,bi)+λR(x). (2)

In (2),

denotes a smooth loss function,

a smooth convex regularizer (e.g., Tikhonov), the regularization weight and the set of feature-response pairs. Depending on the form of and

, many important machine learning problems—such as logistic regression, ridge regression and soft-margin support vector machines—are special cases of ERM.

We focus on the case where both and are large, and is ill-conditioned (i.e., the condition number of is large).111In this work, the condition number of a (strongly) convex function refers to that of its Hessian. In the context of ERM, this means the dataset that defines (1) is large and the feature vectors have high ambient dimension. However, the points typically belong to a low-dimensional manifold. Such a setting is particularly relevant in the big-data era, due to unprecedented data acquisition abilities.

When is large, the computational costs incurred by the batch optimization methods (both first- and second-order) are prohibitive, since in such methods the gradients of all the component functions need to be computed at each iteration. Therefore, stochastic (randomized) optimization methods have become very popular. At each iteration, only a subset of component functions, rather than all of them, are processed. In this way, for a given time budget, much more progress can be made towards global optima compared to a single-step taken for batch methods. When is large, Newton- or quasi-Newton-based methods  (both batch and stochastic) incur both high computational and storage complexities. Consequently only first-order and limited-memory quasi-Newton methods (e.g., L-BFGS [2]) are practical in this setting.

### I-a Related Works

When both and are large, as in our setting, most of research efforts have been devoted to stochastic first-order methods

, which include stochastic gradient descent (SGD)

[3, 4]

and its variance-reduced modifications

[5, 6, 7, 8]. However, these methods do not make use of the curvature information. This limits their abilities to find highly accurate solutions for ill-conditioned problems. In order to incorporate the curvature information in the limited-memory setting, recently much progress have been made toward developing stochastic L-BFGS algorithm. A partial list of such works includes [9, 10, 11, 12, 13, 14, 15, 16, 17]. In particular, the first convergent algorithm was proposed by Mokhtari and Ribeiro [12]. Later, the algorithm in [15] makes use of the subsampled Hessian-vector products to form the correction pairs (as opposed to using difference of stochastic gradients) and achieves better results than previous methods. However, the convergence rate is sublinear (in the strongly-convex case), similar to that of SGD. Later, Moritz et al[16] combines this method with stochastic variance-reduced gradient (SVRG) and proves linear convergence of the resulting algorithm. The algorithm in [17] maintains the structure of this algorithm but incorporates the block BFGS update to collect more curvature information in the optimization process. Although the convergence rate of this new method is similar to that in [16], experimental results demonstrate practical speedups introduced by the block BFGS update. Finally, there also exist a large volume of works on decentralized second-order methods [18, 19, 20, 21, 22, 23, 24] that aim to coordinate multiple distributed agents (with computational and storage abilities) in the optimization task. Since we are not concerned with decentralized optimization algorithms in this paper, we do not discuss these works in details here.

### I-B Motivations and Main Contributions

Our work can be motivated from both theory and practice. In terms of theory, although linear convergence (in expectation) has been shown for both algorithms in [16] and [17], the convergence rates (and hence computational complexities) therein can be potentially further improved. (The analysis method in [17] mainly follows that in [16]

, so we treat the analyses in both works in a unified manner.) In addition, these results may be strengthened in a probabilistic sense, e.g., from convergence in expectation to convergence in probability or almost surely. In terms of practice, in addition to block BFGS update, there may exist several other practical strategies that can potentially further accelerate

222In this work, we refer “acceleration” to general strategies that speed up the algorithm, not necessarily the ones based on momentum methods. the algorithm in [16]. Based on these two aspects, our work consists of the following main contributions.

1) We propose a coordinate transformation framework to analyze stochastic L-BFGS-type algorithms in [16] and [17]. Our analysis framework yields a much improved (linear) convergence rate (both in expectation and almost surely) and computational complexity. The essential idea of our method is to unify the analysis of stochastic first-order and second-order methods; as a result, it opens new avenues of designing and analyzing other variants of stochastic second-order algorithms based on their first-order counterparts.

2) We conduct a computational complexity analysis for the stochastic L-BFGS algorithms, which is the first of its kind.

3) We propose several practical acceleration strategies to speed up the convergence of the stochastic L-BFGS algorithm in [16]. We demonstrate the efficacy of our strategies through numerical experiments on logistic and ridge regression problems. We also prove linear convergence for most of these strategies.

## Ii Preliminaries

### Ii-a Notations

We use lowercase, bold lowercase and bold uppercase letters to denote scalars, vectors and matrices respectively. For a matrix , we denotes its -th entry as . For a function , define the function as the composition , for any . A continuously differentiable function is -smooth () if and only if is -Lipschitz on . We use to denote the set of natural numbers. For any , we define and . Accordingly, for a sequence of sets , define , for any . As usual, and . For a set , denote its complement as . For any sequence , we define if . We use to denote both the norm of a vector and the spectral norm of a matrix. We use and (with subscripts and superscripts) to denote the approximate Hessian and approximate inverse Hessian in L-BFGS algorithms respectively, following the convention in [25]. is also known as the metric matrix [26]. In this work, technical lemmas (whose indices begin with ‘T’) will appear in Appendix E.

### Ii-B Assumptions on Component Functions fi

###### Assumption 1.

For each , is convex and twice differentiable on . For ERM problems (2), we assume these two properties are satisfied by the loss function in its first argument on and by the regularizer on .

###### Assumption 2.

For each , is -strongly convex and -smooth on , where .

###### Remark 1.

Assumptions 1 and 2 are standard in the analysis of both deterministic and stochastic second-order optimization methods. The strong convexity of in Assumption 2 ensures positive curvature at any point in , which in turn guarantees the well-definedness of the BFGS update. As a common practice in the literature [15, 16], this condition can typically be enforced by adding a strongly convex regularizer (e.g., Tikhonov) to . Due to the strong convexity, (1) has a unique solution, denoted as .

## Iii Algorithm

We will provide a refined analysis of the optimization algorithm (with some modifications) suggested in [16] and so we recapitulate it in Algorithm 1. This algorithm can be regarded as a judicious combination of SVRG and L-BFGS algorithms. We use and to denote the outer and inner iteration indices respectively and to denote the index of metric matrices . We also use and to denote an inner iterate and outer iterate respectively.

Each outer iteration consists of inner iterations. Before the inner iterations, we first compute a full gradient . In each inner iteration , the only modification that we make with respect to (w.r.t.) the original algorithm in [16] is that in computing the stochastic gradient , the index set of size is sampled with replacement nonuniformly [27, 28]. Specifically, the elements in are sampled i.i.d. from a discrete distribution , such that for any , . As will be seen in Lemma 3, compared to uniform sampling, nonuniform sampling leads to a better variance bound on the stochastic gradient . Using and , we then compute according to (7) in Algorithm 1, where

 ∇fBs,t(x)≜1b∑i∈Bs,t1npi∇fi(x),∀x∈Rd. (3)

This specific way to construct reduces the variance of to zero as (see Lemma 3), and serves as a crucial step in the SVRG framework.

Then we compute the search direction . The metric matrix serves as an approximate of the inverse Hessian matrix and therefore contains the local curvature information at the recent iterates. Consequently, may act as a better descent direction than . Since storing may incur high storage cost (indeed, space) for large , (stochastic) L-BFGS-type methods compute each time from a set of recent correction pairs (that only occupies space) and . In this way, the limitation on memory can be overcome.

Denote as the memory parameter. We next describe the construction of the set of recent correction pairs , where . Together with , this set will be used to compute the matrix-vector product . Before doing so, in line 13, we first compute the averaged past iterates from for every inner iterates, where . Based on and , we compute the most recent correction pair in line 16. Following [15], in computing , we first sample an index set of size uniformly without replacement, and then let , where

 ∇2fTr(¯¯¯xr)≜1bH∑i∈Tr∇2fi(¯¯¯xr), (4)

denotes the sub-sampled Hessian at . Finally, we update to by inserting into and deleting from it.

Based on , a direct approach to compute would be computing first and then forming the product with . Computing involves applying BFGS updates to

 H(r−M′)r≜sTryr∥yr∥2I (5)

using . For each , the update is

 H(k)r=(I−ykskTykTsk)H(k−1)r(I−skykTykTsk)+skskTykTsk. (6)

Finally we set . Instead of using this direct approach, we adopt the two-loop recursion algorithm [25, Algorithm 7.4] to compute as a whole. This method serves the same purpose as the direct one, but, as we shall see, with much reduced computation.

At the end of each outer iteration , the starting point of next outer iteration, is either uniformly sampled (option I) or averaged (option II) from all the past inner iterates . As shown in Theorem 1, these two options can be analyzed in a unified manner.

###### Remark 2.

Under many scenarios (e.g., the ridge and logistic regression problems in Section VII), the smoothness parameters

can be accurately estimated. (For ERM problems, these parameters are typically data-dependent.) If in some cases, accurate estimates of these parameters are not available, we can simply employ uniform sampling, which is a special case of our weighted sampling technique.

## Iv Convergence Analysis

### Iv-a Definitions

Let and be permutations of such that and . Given any , define

 ¯¯¯μ˜n≜1˜n˜n∑j=1μi–jand¯L˜n≜1˜nn∑j=n−˜n+1L¯ij. (8)

Accordingly, define

 κmax≜Lmaxμminandκ˜n≜¯¯¯¯L˜n¯¯¯μ˜n. (9)

In particular, define , and . Denote the probability space on which the sequence of (random) iterates in Algorithm 1 is defined as , where is the Borel -algebra of . We also define a filtration such that contains all the information up to the time . Formally, , where denotes the

-algebra generated by random variables

. Define .

To introduce our coordinate transformation framework, we define some transforms of variables appearing in Algorithm 1. Specifically, for any , define , , and .333As will be shown in Lemma 4, for any , for some . Therefore, (and ) are well-defined. We also define transformed functions and , for any and .

To state our convergence results, we define the notions of linear convergence and R-linear convergence.

###### Definition 1.

A sequence is said to converge to linearly (or more precisely, Q-linearly) with rate if

 limsupn→∞∥xn+1−¯¯¯x∥∥xn−¯¯¯x∥≤ι. (10)

We say R-linearly with rate if there exists a nonnegative sequence such that for sufficiently large and linearly with rate .

### Iv-B Preliminary Lemmas

From the definitions of transformed variables and functions in Section IV-A, we immediately have the following lemmas.

###### Lemma 1.

For any and , we have

 ˜fi,r(˜xs,t,r) =fi(xs,t), (11) ∇˜fi,r(˜xs,t,r) =H1/2r∇fi(xs,t), (12) ∇2˜fi,r(˜xs,t,r) =H1/2r∇2fi(xs,t)H1/2r. (13)
###### Lemma 2.

If there exist such that for all , then for any and , is twice differentiable, -strongly convex and -smooth on . Consequently, is twice differentiable, -strongly convex and -smooth on .

Next we derive two other lemmas that will not only be used in the analysis later, but have the potential to be applied to more general problem settings. Specifically, Lemma 3 can be applied to any stochastic optimization algorithms based on SVRG and Lemma 4 can be applied to any finite-sum minimization algorithms based on L-BFGS methods (not necessarily stochastic in nature). The proofs of Lemmas 3 and 4 are deferred to Appendices A and B respectively.

###### Lemma 3 (Variance bound of vs,t).

In Algorithm 1, we have and

 EBs,t[∥vs,t−∇f(xs,t)∥2|Fs,t] ≤4¯¯¯¯Lb(f(xs,t)−f(x∗)+f(xs)−f(x∗)). (14)
###### Remark 3.

In previous works [16, 17], a uniform mini-batch sampling of was employed, and different variance bounds of were derived. In [16, Lemma 6], the bound was

 4Lmax(f(xs,t)−f(x∗)+f(xs)−f(x∗)). (15)

In [17, Lemma 2], this bound was slightly improved to

 4Lmax((f(xs,t)−f(x∗)) (16)

However, both of these bounds fail to capture the dependence on the mini-batch size . In contrast, in this work we consider a nonuniform mini-batch sampling (with replacement). Due to division by and (indeed in many cases, ), our bound in (14) is superior to (15) and (16). As will be seen in Theorem 1, our better bound (14) leads to a faster (linear) convergence rate of Algorithm 1.

###### Lemma 4 (Uniform Spectral Bound of {Hr}r≥0).

The spectra of are uniformly bounded, i.e., for each , , where444We assume for any since we focus on the setting where is ill-conditioned, i.e., . If for some , then and remains the same. The proof for this case can be straightforwardly adapted from that in Section B.

 γ≜1(M+1)¯¯¯¯LbHandΓ≜κM+1bH¯¯¯μbH(κbH−1). (17)
###### Remark 4.

In [13, 15, 16], the authors make use of a classical technique in [2] to derive a different uniform spectral bound of . Their technique involves applying and recursively to the BFGS update rule

 B(k)r=B(k−1)r−B(k−1)rskskTB(k−1)rskTB(k−1)rsk+ykyTksTkyk, (18)

where denotes the approximate Hessian matrix at step in the reconstruction of . The lower and upper bounds derived by this technique are

 ˜γ=1(d+M)Lmaxand˜Γ =(d+M)d+M−1κd+M−1maxμmin

respectively. As will be seen in Proposition 1, the overall computational complexity of Algorithm 1 heavily depends on the estimated (uniform) condition number of . Therefore, it is instructive to compute this quantity for both and as

 κH ≜Γγ=(M+1)κM+2bHκbH−1≈(M+1)κM+1bH, (19) ˜κH ≜˜Γ˜γ=(M+d)M+dκM+dmax, (20)

where the approximation in (19) follows from (see Footnote 4). By comparing (19) and (20), we notice our estimate for the condition number of , namely , is smaller than those in [15] and [16], namely , in several aspects. First, does not grow (exponentially) with the data dimension . Second, depends on , which is usually much smaller than . Third, even if we set in (20), the factor in (19) is much smaller than the factor in (20). As a result, our improved estimate of the condition number of will lead to a much better computational complexity estimate (see Proposition 1).

### Iv-C Main Results

Our main convergence results consist of Theorem 1 and Corollary 1, which provide linear convergence guarantees of to in expectation and almost surely, respectively.

###### Theorem 1.

In Algorithm 1, choose and sufficiently large. With either option I or II, we have

 E (21) ρ =bγ¯¯¯μmη(b−4ηΓ¯¯¯¯L)+4ηΓ¯¯¯¯Lb−4ηΓ¯¯¯¯L(1+1m)<1. (22)
###### Proof.

Fix an outer iteration and consider an inner iteration . Define . For brevity, we omit the dependence of on and . The iteration in line 10 of Algorithm 1 becomes

 ˜xs,t+1,r=˜xs,t,r−η˜vs,t,r. (23)

Define . From Lemmas 1 and 3,

 EBs,t[˜vs,t,r|Fs,t]=H1/2r∇f(xs,t)=∇˜fr(˜xs,t,r)and (24) EBs,t[∥˜δs,t,r∥2∣∣Fs,t]≤∥∥H1/2∥∥24¯¯¯¯Lb(f(xs,t)−f(x∗) +f(xs)−f(x∗)) ≤4Γ¯¯¯¯Lb(˜fr(˜xs,t,r)−˜fr(˜x∗r)+˜fr′(˜xs,r′)−˜fr′(˜x∗r′)), (25)

where . Using (23), we can express the distance between and as

 ∥˜xs,t+1,r−˜x∗r∥2=∥˜xs,t,r−˜x∗r∥2 +2η(η2∥˜vs,t,r∥2−⟨˜vs,t,r,˜xs,t,r−˜x∗r⟩). (26)

We can show

 η2∥˜vs,t,r∥2−⟨˜vs,t,r,˜xs,t,r−˜x∗r⟩≤−(˜fr(˜xs,t+1,r)−˜fr(˜x∗r)) −⟨˜δs,t,r,˜xs,t+1,r−˜x∗r⟩−γ¯¯¯μ2∥˜xs,t,r−˜x∗r∥2 (27)

from steps (28) to (32) on the next page.

In (30), we use the condition . In (32), we use the -smoothness of and the -strong convexity of in Lemma 2 respectively.
Now, substituting (27) into (26), we have

 ∥˜xs,t+1,r−˜x∗r∥2≤(1−ηγ¯¯¯μ)∥˜xs,t,r−˜x∗r∥2 −2η(˜fr(˜xs,t+1,r)−˜fr(˜x∗r))−2η⟨˜δs,t,r,˜xs,t+1,r−˜x∗r⟩ =(1−ηγ¯¯¯μ)∥˜xs,t,r−˜x∗r∥2−2η(˜fr(˜xs,t+1,r)−˜fr(˜x∗r)) +2η2∥˜δs,t,r∥2−2η⟨˜δs,t,r,˜xs,t,r−˜x∗r⟩. (33)

Taking expectation w.r.t.  and using (24) and (25), we have

 EBs,t[∥˜xs,t+1,r−˜x∗r∥2+2η(˜fr(˜xs,t+1,r)−˜fr(˜x∗r))∣∣Fs,t] ≤(1−ηγ¯¯¯μ)∥˜xs,t,r−˜x∗r∥2+8bΓ¯¯¯¯Lη2(˜fr(˜xs,t,r)−˜fr(˜x∗r) +˜fr′(˜xs,r′)−˜fr′(˜x∗r′)). (34)

By bounding the factor by , we can telescope (34) over and obtain

 EBs,(m−1