On the Acceleration of L-BFGS with Second-Order Information and Stochastic Batches

This paper proposes a framework of L-BFGS based on the (approximate) second-order information with stochastic batches, as a novel approach to the finite-sum minimization problems. Different from the classical L-BFGS where stochastic batches lead to instability, we use a smooth estimate for the evaluations of the gradient differences while achieving acceleration by well-scaling the initial Hessians. We provide theoretical analyses for both convex and nonconvex cases. In addition, we demonstrate that within the popular applications of least-square and cross-entropy losses, the algorithm admits a simple implementation in the distributed environment. Numerical experiments support the efficiency of our algorithms.

Authors

• 61 publications
• 41 publications
• 37 publications
• 83 publications
• Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization

Variance reduction techniques like SVRG provide simple and fast algorith...
05/01/2019 ∙ by Rong Ge, et al. ∙ 0

• Distributed stochastic inertial methods with delayed derivatives

Stochastic gradient methods (SGMs) are predominant approaches for solvin...
07/24/2021 ∙ by Yangyang Xu, et al. ∙ 0

• Cubic Regularization with Momentum for Nonconvex Optimization

Momentum is a popular technique to accelerate the convergence in practic...
10/09/2018 ∙ by Zhe Wang, et al. ∙ 4

• Curvature-Exploiting Acceleration of Elastic Net Computations

This paper introduces an efficient second-order method for solving the e...
01/24/2019 ∙ by Vien V. Mai, et al. ∙ 0

• SONC Optimization and Exact Nonnegativity Certificates via Second-Order Cone Programming

The second-order cone (SOC) is a class of simple convex cones and optimi...
12/14/2020 ∙ by Victor Magron, et al. ∙ 0

• Blackwell dominance in large samples

We study repeated independent Blackwell experiments; standard examples i...
06/06/2019 ∙ by Xiaosheng Mu, et al. ∙ 0

• SONIA: A Symmetric Blockwise Truncated Optimization Algorithm

This work presents a new algorithm for empirical risk minimization. The ...
06/06/2020 ∙ by Majid Jahani, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the finite-sum minimization problem of the form

 minw∈Rd{F(w):=1n∑i∈[n]f(w;xi,zi)def=1n∑i∈[n]fi(w)}, (1)

where , and are the data pairs. Throughout the paper, we assume there exists a global optimal solution of (1); in other words, we have a lower bound of (1).

In general, the problem of form (1

) covers a wide range of convex and nonconvex problems including logistic regression

[5], multi-kernel learning[2, 33], conditional random fields [14][10], etc. Classical first-order methods to solve (1) are gradient descent (GD) [26]

[29, 31]. A large class of optimization methods can be used to solve (1), where the iterative updates can be generalized as follows,

 wk+1=wk+αkpk,with pk=−Hkgk, (2)

where is some descent direction, is an inverse Hessian approximation of at , and is an estimate of .

When

is an identity matrix, the update is considered a first-order method. Numerous work has focused on the choice of

such as SAG/SAGA [30, 7], MISO/FINITO [17, 8], SDCA [32], SVRG/S2GD [11, 13], SARAH [23, 24]. Nevertheless, with the importance of second-order optimization providing potential curvature around local optima and thus promoting fast convergence, the choice of non-identity is crucial to the development of modern optimization algorithms.

Within the framework of second-order optimization, a popular choice for is the inverse Hessian; however, we lack an efficient way to invert matrices, leading to increases in computation and communication costs to a problem for the distributed setting. Motivated by this, quasi-Newton methods, among which BFGS is one of the most popular, were developed, including a practical variant named limited-memory BFGS (L-BFGS) [26]. It has been widely known that batch methods have been successfully applied in first-order algorithms and provide effective improvements, but it remains a problem for L-BFGS due to the instability caused by randomness between different gradient evaluations. Therefore, the development of an efficient and stable L-BFGS is necessary.

Our contributions. In this paper, we analyze L-BFGS with stochastic batches for both convex and nonconvex optimization, as well as its distributed implementation. LBFGS-H originates from the idea of L-BFGS and uses Hessian information to approximate the differences of gradients. LBFGS-F combines L-BFGS with Fisher information matrix from the natural gradient algorithm [1, 19, 27]. We show that they are efficient for minimizing finite-sum problems both in theory and in practice. The key contributions of our paper are summarized as follows.

• [noitemsep,nolistsep]

• We propose a framework for approximating the differences of gradients in the L-BFGS algorithm that ensures stability for the general finite-sum problems. We show it converges linearly to a neighborhood of the optimal solution for convex and nonconvex finite-sum problems under standard assumptions [3].

• In a distributed environment, we introduce a variant LBFGS-F where the Hessian matrix for approximating gradient differences is replaced by the Fisher information matrix [19].

• With a potential acceleration in practice using ADAM techniques [12], we verify the competitive performances of both LBFGS-H and LBFGS-F against mainstream optimization methods in both convex and nonconvex applications.

2 The Algorithm

In this section, we propose a new stochastic L-BFGS framework, as well as its distributed implementation with Fisher information matrix. Before proceeding to the new algorithm, let us revisit the procedure for the classical L-BFGS.

2.1 Limited-memory BFGS

The classical L-BFGS algorithm [26] is presented as below.

In each iteration, first, we estimate the direction by using curvature pairs . Then, the learning rate is chosen such that certain condition (e.g. line search) is satisfied, and we make an update. Last, we evaluate the curvature pairs and replace the pairs stored in the memory while keeping the number of curvature pairs no larger than . The key step in this procedure is the evaluation of the search direction using the curvature pairs, which appears as the well-known two-loop recursion (Algorithm 0(b)). Note that in the classical L-BFGS, the main algorithm usually applies a line-search technique for choosing the learning rate .

The intrinsic idea within L-BFGS is to utilize the curvature information implied by the vector pairs to help regularize the gradient direction. However, within the setting of stochastic batches, the update of , where the batch gradient is defined as

 ∇FSk:=1|Sk|∑i∈Sk∇fi(wk),

makes it difficult to stabilize the behavior of the algorithm. One of the remedies is to assume that there is an overlap between the samples and , i.e., , and replace the with in  [3]. However, this idea requires the batch size to be large enough.

Recall from the Taylor expansion for a multivariate vector-valued function

 g(wk+1)=g(wk)+Jg(wk)(xk+1−xk)+o(∥wk+1−wk∥2)1d,

where is the Jacobian matrix with respect to , and has all elements to be . Hence, we can conclude that: when is close to ,

 yk =∇F(wk+1)−∇F(wk)=Bk(wk+1−wk)+o(∥wk+1−wk∥2)1d≈Bk(wk+1−wk),

where is the Hessian at , which is exactly the secant equation in BFGS. Therefore, another possible remedy to stabilize L-BFGS is to approximate the differences of gradient using (approximate) second-order information, i.e.,

 yk=Bk(wk+1−wk),

as this allows smooth and stable evaluation of . Meanwhile, the Hessian-vector product can be easily computed and is not expensive [18, 20].

2.2 Stochastic L-BFGS with Hessian Information and Vector-free Two-loop Recursion

The proposed algorithm of stochastic L-BFGS with Hessian information (LBFGS-H) is formulated by replacing with the stochastic version of in Algorithm 0(a), i.e.,

 yk=BSkk(wk+1−wk), (3)

where is the stochastic batch picked at iteration and .

For an efficient implementation in a map-reduce environment (e.g. Hadoop, Spark), we use a vector-free L-BFGS (VL-BFGS) update in Algorithm 2 originated from [4] for the two-loop recursion. [4] proposes a vector-free L-BFGS based on the classical L-BFGS where they set . However, the choice of is very important, therefore we propose the vector-free L-BFGS algorithm applicable to any feasible as follows.

In details, if we observe the direction generated by the two-loop recursion in Algorithm 0(b), we are able to figure out that we can represent the output direction using the invariable base vectors, i.e.,

 b1=sk−m+1,…,bm=sk,bm+1=yk−m+1,…,b2m=yk,b2m+1=gk. (4)

The direction after the first loop can be written as , and after we scale the direction with we obtain , so the final result of the two-loop recursion can be written as

 −Hk∇FSk(wk)=δ0r0+∑ml=1δlbl.

Note that the coefficients are evaluated with only dot-products which are defined in the matrix of the following form:

 Mpq={yTk−m+psk−m+q,if p,q∈{1,2,…,m},gTksk−m+q,if p=m+1,q∈{1,2,…,m}. (5)

Let us denote as in Algorithm 2. In the first loop, the evaluations of are the same as [4], where is a linear combination of with the same corresponding coefficients , and from to ,

 αi=ρisTiq=ρisTi(∑2ml=m+1δlyk+l−2m+δ2m+1gk)=1M(j,j)∑m+1l=1δl+mM(l,j).

However, in the second loop which contributes to the coefficient evaluations of , from to ,

 β =ρiyTir=ρiyTi(δ0r+∑ml=1δlbl)=1M(j,j)(δ0Yj+∑ml=1δlM(j,l)),

when we define a vector with the elements .

Therefore, we can conclude that Algorithm 2 is mathematically equivalent to Algorithm 0(b). It is trivial to verify that with , Algorithm 2 recovers the vector-free L-BFGS in [4].

2.3 Fisher Information Matrix as a Hessian Approximation and Distributed Optimization

When we have no access to the second-order information, instead of utilizing , we are still able to use approximations of . Recently, numerous research has been conducted on the natural gradient algorithm, where in the update (2), the inverse Fisher information matrix serves as  [1, 20].

If we further consider the cost function in (1) as , where , is a convex loss and is some network structure, then an element of the Hessian matrix can be written as:

 Bij=∑dk=0∑dl=0∂L2(h(w))∂hl(w)∂hk(w)∂hl(w)∂wj∂hk(w)∂wi+∑dk=0∂L(h(w))∂hk(w)∂h2k(w)∂wj∂wi, (6)

where the first term is the component of the Hessian due to variation in ; since we are only looking at variation in , we do effectively a change of basis using the Jacobian of . The second term, on the other hand, is the component that is due to variation in , which is why we see the Hessian of . As it goes to the neighborhood of the minimum of the cost , the first derivatives are approaching zero, which indicates that the second term is negligible. However, the first term, as an approximation of Hessian, which can be written as the following in the matrix form, is no different but identical to the Generalized Gauss-Newton matrix (GGN) [19],

 Bk≈[J(wk)]TL(k)hhJ(wk)def% =Bk, (7)

where is the Jacobian matrix of with respect to at , is the Hessian matrix of with respect to at and we use to denote the Hessian or Hessian approximation used to smoothen , i.e., .

It has been verified that in the cases of popular loss functions such as cross-entropy and least-squares, the GGN matrix is exactly the Fisher information matrix (FIM)

[19]. Note that here can be nonconvex which covers the applications of neural networks. Under the framework of stochastic L-BFGS we propose, we introduce the stochastic L-BFGS with Fisher information (LBFGS-F) by replacing in (3) with the FIM. Note that when the predictor is linear, i.e. , with the loss function as either the cross-entropy or the least-squares, LBFGS-F is identical to LBFGS-H (GGN = FIM = Hessian). Similarly, this also applies to the batch version of the FIM.

As a map-reduce implementation of L-BFGS, the VL-BFGS update is praised for the parallelizable and distributed updates, and the possible communication cost in a distributed environment is in each iteration [4], where is a small constant among the choices of . The classical L-BFGS needs an update on the gradient to calculate and this can be implemented by calculating local gradients from different workers and then the local gradients being aggregated on the server. We can still apply similar tricks to LBFGS-F but it requires more strict assumptions.

Assumption 1.

[Diagonal Hessian of ] The Hessian of the loss function with respect to the , is always diagonal.

Remark 1.

This condition is not always true throughout all applications; however, in the cases of least-square loss and cross-entropy loss, where the prior case has with as the stochastic batch, and in the later case , the Hessian is obviously always diagonal.

Consider a specific batch . If we split it into blocks, where the blocks are denoted as , and assume the corresponding Jacobian block matrices as , then the Hessian vector product with any vector can be written as

 BSkk=[Jh]TLSkhhJh=[JSk1h⋯,JSkτh]TLSkhh[JSk1h⋯,JSkτh]

and since is diagonal, we can write in the form of diagonal blocks with the sizes to be , thus the above is equivalent to

 BSkkv=∑τi=1[JSkih]TLSkihhJSkihv=∑τi=1BSkikv,∀ v,

which means that we can evaluate the FIM-vector products with data distributed on different workers and then aggregate them on the server. The communication cost in each round can be .

Theorem 1.

[Distributed Optimization and Communication Cost] Suppose that Assumption 1 holds. Then Algorithm LBFGS-F can be implemented in a distributed fashion, with a total communication cost of in each round, where is the number of workers.

2.4 Implementation Details

In this part, we cover important techniques for our stochastic LBFGS framework. The initialization and momentum are crucial in accelerating the algorithm. Meanwhile, keeping positive semi-definiteness is significant for finding the correct direction , especially in the nonconvex setting.

Initialization and Momentum

The initialization is crucial in the L-BFGS algorithm. The original L-BFGS proposes to use as where is a constant and a commonly great choice suggested is . However, this may not be the case in the stochastic setting where stochasticity can lead to considerable fluctuations in Hessian scalings over the iterations. Therefore, we consider to use a momentum technique where we combine the past first-order information with the current one. With the recent success of ADAM [12], the scaling of the ADAM stochastic gradient provides excellent and stable performance. The authors evaluate the momentum stochastic gradient: with

and the momentum of the second moment of stochastic gradient

followed by a bias correction step, i.e., and , where is an approximation to the diagonal of the Fisher information matrix [27]. Then ADAM makes a step with a direction .

Hence, in our experiments, we estimate with the ADAM preconditioner, i.e., , and apply momentum to update the stochastic gradient with . Note that when the memory in Algorithm 2, our algorithm completely recovers ADAM.

Guarantees of Positive Semi-definiteness

The standard BFGS updates can fail in handling non-convexity because of difficulty in approximating Hessian with a positive definite matrix [6, 21]

. Even L-BFGS with limited updates over each iteration, cannot guarantee the eigenvalues of approximate Hessian bounded above and away from zero. One has to apply a cautious update where the curvature condition

is satisfied in order to maintain the positive definiteness of Hessian approximations [26]. As a well-suited approach to our algorithm, we employ a cautious strategy [16]: we skip the update, i.e., set , if

 yTksk≥ϵ∥sk∥2 (8)

is violated, where is a predefined positive constant. With the stated condition guaranteed at each L-BFGS update, the eigenvalues of the Hessian approximations generated by our framework are bounded above and away from zero (Lemma 2).

3 Convergence Analysis

In this section, we study the convergence of our stochastic L-BFGS framework. Due to the stochastic batches of the LBFGS-F and LBFGS-H, by using a fixed learning rate, one cannot establish the convergence to the optimal solution (or first-order stationary point) but only to a neighborhood of it. We provide theoretical foundations for both strongly convex and nonconvex objectives. Throughout the analysis, we will assume that , the function is -Lipschitz continuous or -smooth. i.e.,

 ∥∇fi(w′)−∇fi(w)∥≤Λ∥w′−x∥,∀w,w′∈Rd, (9)

or equivalently,

 fi(w′)≤fi(w)+∇fi(w)T(w′−w)+Λ2∥w′−w∥2,∀w,w′∈Rd. (10)

The above implies that is also -smooth.

3.1 Strongly Convex Case

Now we are ready to present the theoretical results for strongly convex objectives. Under this circumstance, the global optimal points is unique. Before proceeding, we need to make the following standard assumptions [3] about the objective and the algorithm.

Assumption 2.

Assume that the following assumptions hold.

• [noitemsep]

• is twice continuously differentiable.

• There exist positive constants and such that for all and all batches of size , where refers to the Hessian (approximation) to stabilize .

• in Algorithm 2 is symmetric and there exists such that .

• The batches are drawn independently and

is an unbiased estimator of the true gradient

for all , i.e., .

We should be aware that Assumption 2B also suggests that there is some such that , i.e., F is strongly convex with and -smooth. Because of -strong convexity, satisfies:

 F(w′)≥F(w)+∇F(w)T(w′−w)+λ2∥w′−w∥2,∀w′,w∈Rd. (11)

In addition, we should remark here that Assumption 2 is different to the standard assumption in [3] in the sense that we do not require a bounded stochastic gradient assumption since such assumption is barely correct in both theory and practice [25]. We also remark that Assumption 2 is generalization to the corresponding assumption in [3] where by setting , we recover the assumption in [3] so Assumption 2C is not a new assumption. Under the above assumptions, we are able to declare the following lemma that the Hessian approximation formulated by Algorithm 2 are bounded above and away from zero.

Lemma 1.

If Assumptions 2A-C hold, then there exist constants such that generated by Algorithm 2 in the stochastic form satisfy:

 μ1I⪯Hk⪯μ2I, for k=0,1,2,…

With the help of Lemma 1, but different from [3], we can prove the following theorem without bounded assumption for the stochastic gradient.

Theorem 2.

Suppose that Assumptions 2A-D hold, s are convex, and let where is the minimizer of . Let be the iterates generated by the stochastic L-BFGS framework with a constant learning rate and with generated by Algorithm 2. Then for all ,

 E[F(wk)− F∗]≤{1−[1−2α(λμ1−αμ22(λ+Λβ(b))Λ)]k}αμ22ΛN4(λμ1−αμ22(λ+Λβ(b))Λ) +[1−2α(λμ1−αμ22(λ+Λβ(b))Λ)]k[F(w0)−F∗]k→∞−−−→αμ22ΛN4(λμ1−αμ22(λ+Λβ(b))Λ),

where , and . (Check a complete version in Appendix A.)

3.2 Nonconvex Case

Under the following standard nonconvex assumptions [3], we can proceed with the convergence for nonconvex problems for the first-order stationary points.

Assumption 3.

Assume that the following assumptions hold.

• [noitemsep]

• is twice continuously differentiable.

• There exists a positive constant such that for all batches of size . is -smooth.

• in Algorithm 2 is symmetric and there exists such that .

• The function is bounded below by a scalar .

• There exist constants and such that for all and batches of size .

• The batches are drawn independently and is an unbiased estimator of the true gradient for all , i.e., .

Similar as the strongly convex case, by setting , Assumption 3B is equivalent to saying that is -smooth or is -Lipschitz continuous which recovers the corresponding assumption in [3]. However, different from the strongly convex case, here we need the bounded gradient assumption (Assumption 3E). Again, with the help of the above assumptions, we can conclude that bounded above and away from zero as follows.

Lemma 2.

If Assumptions 3A-C hold, then there exist constants such that generated by Algorithm 2 (we use a skipping scheme in Section 2.4, i.e., we skip the update by setting when (8) is violated) in the stochastic form satisfy:

 μ1I⪯Hk⪯μ2I, for k=0,1,2,…

With Lemma 2, the convergence to a neighborhood can also be proven for nonconvex cases.

Theorem 3.

Suppose that Assumptions 3A-F hold. Let be the iterates generated by the stochastic L-BFGS framework with a constant learning rate and starting from by setting whenever (8) is violated. Then for all ,

 E[1L∑L−1k=0∥∇F(wk)∥2] ≤αμ22γ2Λμ1+2[F(w0)−F∗]αμ1LL→∞−−−→αμ22γ2Λμ1.

4 Numerical Experiments

In this section, we present numerical results to illustrate the properties and performance of our proposed algorithms (LBFGS-H and LBFGS-F) on both convex and nonconvex applications. For comparison, we show performance of popular stochastic gradient algorithms, namely, ADAM [12], ADAGRAD [9] and SGD (momentum SGD). Besides, we include the performance for classical L-BFGS where , and a stochastic L-BFGS as LBFGS-S where we set In the convex setting, we test logistic regression problem on ijcnn1 111Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

. where LBFGS-H is identical to LBFGS-F because of the linear predictor, so we omit the results for LBFGS-F. On the other hand, we show performance of 1-hidden layer neural network (with 300 neurons) and LeNet-5 (a classical convolutional neural network

[15] on MNIST

. Across all the figures, each epoch refers to a full pass of the dataset, i.e.,

Figure 1 shows sub-optimality (training loss for the last column) and test errors of various methods with batch sizes and on the logistic regression problem with ijcnn1 for the first two columns, and LBFGS-H exhibits competitive performance with ADAM, SGD and ADAGRAD while LBFGS-S seems highly unstable. On the nonconvex examples for the last two columns in the figure, similar results are presented with LBFGS-S to be extremely unstable and slow.

To further show the robustness of LBFGS-H (LBFGS-F), we run each method with different batch sizes and 100 different random seeds on the logistic regression problem with dataset ijcnn1 in Figure 2, and report the results. The dotted lines represent the best and worst performance of the corresponding algorithm and the solid line shows the average performance. Obviously, with large batch sizes, the performance of ADAM, ADAGRAD and SGD worsen while LBFGS-H behaves steadily fast and outperforms the others in sub-optimality. This also conveys that to achieve the same accuracy, fewer epochs are needed, leading to fewer communications for our framework when the batch size is large.

The ability to use a large batch size is of particular interest in a distributed environment since it allows us to scale to multiple GPUs without reducing the per-GPU workload and without sacrificing model accuracy. In order to illustrate the benefit of large batch sizes, we evaluate the stochastic gradient on a neural network with different batch sizes () on a single GPU (Tesla K80), and compare the computational time against that of the pessimistic and utopian cases in Figure 3. Up to , the computational time stays almost constant; nevertheless, with a sufficiently large batch size (), the problem becomes computationally bounded and suffers from the computing resource limited by the single GPU, hence doubling batch size leads to doubling computational time. Therefore, the efficiency of our proposed algorithm shown in Algorithm 2 can benefit tremendously from a distributed environment.

5 Conclusion

We developed a novel framework for the L-BFGS method with stochastic batches that is stable and efficient. Based on the framework, we proposed two variants – LBFGS-H and LBFGS-F, where the latter tries to employ Fisher information matrix instead of the Hessian to approximate the difference of gradients. LBFGS-F also admits a distributed implementation. We show that our framework converges linearly to a neighborhood of the optimal solution for convex and nonconvex settings under standard assumptions, and provide numerical experiments on both convex applications and nonconvex neural networks.

Acknowledgments

Jie Liu was partially supported by the IBM PhD Fellowship. Martin Takáč was partially supported by the U.S. National Science Foundation, under award number NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796. We would like to thank Courtney Paquette for her valuable advice on the paper.

References

• [1] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
• [2] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality and the smo algorithm. In ICML, 2004.
• [3] Albert S. Berahas, Jorge Nocedal, and Martin Takáč.

A multi-batch L-BFGS method for machine learning.

In NIPS, pages 1055–1063, 2016.
• [4] W. Chen, Z. Wang, and J. Zhou. Large-scale L-BFGS using mapreduce. In NIPS, pages 1332–1340, 2014.
• [5] David Roxbee Cox.

The regression analysis of binary sequences.

Journal of the Royal Statistical Society, 20(2):215–242, 1958.
• [6] Yu-Hong Dai. Convergence properties of the BFGS algoritm. SIAM Journal on Optimization, 13(3):693–701, 2002.
• [7] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
• [8] Aaron Defazio, Justin Domke, and Tibério Caetano. A faster, permutable incremental gradient method for big data problems. In ICML, pages 1125–1133, 2014.
• [9] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
• [10] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edition, 2009.
• [11] Rie Johnson and Tong Zhang.

Accelerating stochastic gradient descent using predictive variance reduction.

In NIPS, pages 315–323, 2013.
• [12] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
• [13] Jakub Konečný, Jie Liu, Peter Richtárik, and Martin Takáč. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10:242–255, 2016.
• [14] John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001.
• [15] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
• [16] Dong-Hui Li and Masao Fukushima. A modified BFGS method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics, 129(1-2):15–35, 2001.
• [17] Julien Mairal. Optimization with first-order surrogate functions. In ICML, pages 783–791, 2013.
• [18] James Martens. Deep learning via hessian-free optimization. In ICML, pages 735–742, 2010.
• [19] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
• [20] James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, pages 2408–2417, 2015.
• [21] Walter F Mascarenhas. The BFGS method with exact line searches fails for non-convex objective functions. Mathematical Programming, 99(1):49–61, 2004.
• [22] A. Meenakshi and C. Rajian. On a product of positive semidefinite matrices. Linear Algebra and its Applications, 295(1):3–6, 1999.
• [23] Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML, pages 2613–2621, 2017.
• [24] Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261, 2017.
• [25] Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takáč. SGD and Hogwild! convergence without the bounded gradients assumption. In ICML, 2018.
• [26] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
• [27] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. ICLR, 2014.
• [28] Michael JD Powell. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. Nonlinear programming, 9(1):53–72, 1976.
• [29] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• [30] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2016.
• [31] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
• [32] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
• [33] Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.

Appendix A Assumptions, Lemmas and Theorems

See 1 See 2 See 3

See 1 See 2

Lemma 3 (Lemma 4 in [13]).

Let be vectors in and . Let be a random subset of of size , chosen uniformly at random from all subsets of this cardinality. Taking expectation with respect to , we have

 E[∥∥1τ∑i∈^Sξi−¯ξ∥∥2]≤1nτn−τ(n−1)∑ni=1∥ξi∥2. (12)
Lemma 4 (Lemma 3 in [25] and Equation (10) in [11]).

If s are convex and -smooth , then ,

 E[∥∇fi(w)−∇fi(w∗)∥2]≤2Λ[F(w)−F(w∗)], (13)

where .

Lemma 5.

If is strongly convex with and s are -smooth , then , the batch gradient has the following bound,

 E[∥∇FS(w)∥2]≤4β(b)Λκ[F(w)−F(w∗)]+2∥∇F(w)∥2+N, (14)

where and . If we further have s convex, the bound shrinks to

 E[∥∇FS(w)∥2]≤4β(b)Λ[F(w)−F(w∗)]+2∥∇F(w)∥2+N. (15)

See 1

See 3

Theorem 4 (A complete version of Theorem 2).

Suppose that Assumptions 2A-D hold, and let where is the minimizer of . Let be the iterates generated by the stochastic L-BFGS framework with a constant learning rate and with generated by Algorithm 2. Then for all ,

 E[F(wk)−F∗] ≤[1−2α(λμ1−αμ22(λ+Λβ(b))Λ)]k[F(w0)−F∗] +{1−[1−2α(λμ1−αμ22(λ+Λβ(b)κ)Λ)]k}αμ22ΛN4(λμ1−αμ22(λ+Λβ(b)κ)Λ) ⟶αμ22ΛN4(λμ1−αμ22(λ+Λβ(b)κ)Λ) as k→∞,

where and . If we further have s convex, then similarly, with a learning rate , we have

 E[F(wk)−F∗] ≤[1−2α(λμ1−αμ22(λ+Λ)Λ)]k[F(w0)−F∗] +{1−[1−2α(λμ1−αμ22(λ+Λβ(b))Λ)]k}αμ22ΛN4(λμ1−αμ22(λ+Λβ(b))Λ) ⟶αμ22ΛN4(λμ1−αμ22(λ+Λβ(b))Λ) as k→∞.
Theorem 5.

Suppose that Assumptions 2A-D hold, s are convex and let where is the minimizer of . Let be the iterates generated by the stochastic L-BFGS framework with , where and satisfies

 E>2μ22μ21(κ+βκ2),2αλμ1−1−2α2μ22Λ(λ+Λβ(b))E>0.

Then starting from , for all ,

 E[F(wk)−F∗]≤G(α,E)k+E,

where

 G(α.E)def=max⎧⎪ ⎪⎨⎪ ⎪⎩α2μ22ΛN4αλμ1−2−4α2μ22Λ(λ+Λβ(b))E],EE[F(w0)−F∗]⎫⎪ ⎪⎬⎪ ⎪⎭. (16)

Appendix B Proofs

b.1 Proof of Lemma 1

Proof.

Instead of analyzing the algorithm in , we study the Hessian approximation where . In this case, the L-BFGS are updated as follows (note that the superscript of denotes the iteration of Hessian updates in each iteration).

• Set such that

 Σ−1I⪯H(0)k⪯σ−1I. (17)
• For set and compute

 H(i+1)k=H(i)k−H(i)ksjsTjH(i)ksTjH(i)ksj+yjyTjyTjsj.
• Set

Note that following the above updates, the curvature pairs are

 yk=Bksk,sk=wk+1−wk.

It is also easy to know that for LBFGS-H,