1 Introduction
We consider the finitesum minimization problem of the form
(1) 
where , and are the data pairs. Throughout the paper, we assume there exists a global optimal solution of (1); in other words, we have a lower bound of (1).
In general, the problem of form (1
) covers a wide range of convex and nonconvex problems including logistic regression
[5], multikernel learning[2, 33], conditional random fields [14][10], etc. Classical firstorder methods to solve (1) are gradient descent (GD) [26]and stochastic gradient descent (SGD)
[29, 31]. A large class of optimization methods can be used to solve (1), where the iterative updates can be generalized as follows,(2) 
where is some descent direction, is an inverse Hessian approximation of at , and is an estimate of .
When
is an identity matrix, the update is considered a firstorder method. Numerous work has focused on the choice of
such as SAG/SAGA [30, 7], MISO/FINITO [17, 8], SDCA [32], SVRG/S2GD [11, 13], SARAH [23, 24]. Nevertheless, with the importance of secondorder optimization providing potential curvature around local optima and thus promoting fast convergence, the choice of nonidentity is crucial to the development of modern optimization algorithms.Within the framework of secondorder optimization, a popular choice for is the inverse Hessian; however, we lack an efficient way to invert matrices, leading to increases in computation and communication costs to a problem for the distributed setting. Motivated by this, quasiNewton methods, among which BFGS is one of the most popular, were developed, including a practical variant named limitedmemory BFGS (LBFGS) [26]. It has been widely known that batch methods have been successfully applied in firstorder algorithms and provide effective improvements, but it remains a problem for LBFGS due to the instability caused by randomness between different gradient evaluations. Therefore, the development of an efficient and stable LBFGS is necessary.
Our contributions. In this paper, we analyze LBFGS with stochastic batches for both convex and nonconvex optimization, as well as its distributed implementation. LBFGSH originates from the idea of LBFGS and uses Hessian information to approximate the differences of gradients. LBFGSF combines LBFGS with Fisher information matrix from the natural gradient algorithm [1, 19, 27]. We show that they are efficient for minimizing finitesum problems both in theory and in practice. The key contributions of our paper are summarized as follows.

[noitemsep,nolistsep]

We propose a framework for approximating the differences of gradients in the LBFGS algorithm that ensures stability for the general finitesum problems. We show it converges linearly to a neighborhood of the optimal solution for convex and nonconvex finitesum problems under standard assumptions [3].

In a distributed environment, we introduce a variant LBFGSF where the Hessian matrix for approximating gradient differences is replaced by the Fisher information matrix [19].

With a potential acceleration in practice using ADAM techniques [12], we verify the competitive performances of both LBFGSH and LBFGSF against mainstream optimization methods in both convex and nonconvex applications.
2 The Algorithm
In this section, we propose a new stochastic LBFGS framework, as well as its distributed implementation with Fisher information matrix. Before proceeding to the new algorithm, let us revisit the procedure for the classical LBFGS.
2.1 Limitedmemory BFGS
The classical LBFGS algorithm [26] is presented as below.
In each iteration, first, we estimate the direction by using curvature pairs . Then, the learning rate is chosen such that certain condition (e.g. line search) is satisfied, and we make an update. Last, we evaluate the curvature pairs and replace the pairs stored in the memory while keeping the number of curvature pairs no larger than . The key step in this procedure is the evaluation of the search direction using the curvature pairs, which appears as the wellknown twoloop recursion (Algorithm 0(b)). Note that in the classical LBFGS, the main algorithm usually applies a linesearch technique for choosing the learning rate .
The intrinsic idea within LBFGS is to utilize the curvature information implied by the vector pairs to help regularize the gradient direction. However, within the setting of stochastic batches, the update of , where the batch gradient is defined as
makes it difficult to stabilize the behavior of the algorithm. One of the remedies is to assume that there is an overlap between the samples and , i.e., , and replace the with in [3]. However, this idea requires the batch size to be large enough.
Recall from the Taylor expansion for a multivariate vectorvalued function
where is the Jacobian matrix with respect to , and has all elements to be . Hence, we can conclude that: when is close to ,
where is the Hessian at , which is exactly the secant equation in BFGS. Therefore, another possible remedy to stabilize LBFGS is to approximate the differences of gradient using (approximate) secondorder information, i.e.,
as this allows smooth and stable evaluation of . Meanwhile, the Hessianvector product can be easily computed and is not expensive [18, 20].
2.2 Stochastic LBFGS with Hessian Information and Vectorfree Twoloop Recursion
The proposed algorithm of stochastic LBFGS with Hessian information (LBFGSH) is formulated by replacing with the stochastic version of in Algorithm 0(a), i.e.,
(3) 
where is the stochastic batch picked at iteration and .
For an efficient implementation in a mapreduce environment (e.g. Hadoop, Spark), we use a vectorfree LBFGS (VLBFGS) update in Algorithm 2 originated from [4] for the twoloop recursion. [4] proposes a vectorfree LBFGS based on the classical LBFGS where they set . However, the choice of is very important, therefore we propose the vectorfree LBFGS algorithm applicable to any feasible as follows.
In details, if we observe the direction generated by the twoloop recursion in Algorithm 0(b), we are able to figure out that we can represent the output direction using the invariable base vectors, i.e.,
(4) 
The direction after the first loop can be written as , and after we scale the direction with we obtain , so the final result of the twoloop recursion can be written as
Note that the coefficients are evaluated with only dotproducts which are defined in the matrix of the following form:
(5) 
Let us denote as in Algorithm 2. In the first loop, the evaluations of are the same as [4], where is a linear combination of with the same corresponding coefficients , and from to ,
However, in the second loop which contributes to the coefficient evaluations of , from to ,
when we define a vector with the elements .
2.3 Fisher Information Matrix as a Hessian Approximation and Distributed Optimization
When we have no access to the secondorder information, instead of utilizing , we are still able to use approximations of . Recently, numerous research has been conducted on the natural gradient algorithm, where in the update (2), the inverse Fisher information matrix serves as [1, 20].
If we further consider the cost function in (1) as , where , is a convex loss and is some network structure, then an element of the Hessian matrix can be written as:
(6) 
where the first term is the component of the Hessian due to variation in ; since we are only looking at variation in , we do effectively a change of basis using the Jacobian of . The second term, on the other hand, is the component that is due to variation in , which is why we see the Hessian of . As it goes to the neighborhood of the minimum of the cost , the first derivatives are approaching zero, which indicates that the second term is negligible. However, the first term, as an approximation of Hessian, which can be written as the following in the matrix form, is no different but identical to the Generalized GaussNewton matrix (GGN) [19],
(7) 
where is the Jacobian matrix of with respect to at , is the Hessian matrix of with respect to at and we use to denote the Hessian or Hessian approximation used to smoothen , i.e., .
It has been verified that in the cases of popular loss functions such as crossentropy and leastsquares, the GGN matrix is exactly the Fisher information matrix (FIM)
[19]. Note that here can be nonconvex which covers the applications of neural networks. Under the framework of stochastic LBFGS we propose, we introduce the stochastic LBFGS with Fisher information (LBFGSF) by replacing in (3) with the FIM. Note that when the predictor is linear, i.e. , with the loss function as either the crossentropy or the leastsquares, LBFGSF is identical to LBFGSH (GGN = FIM = Hessian). Similarly, this also applies to the batch version of the FIM.As a mapreduce implementation of LBFGS, the VLBFGS update is praised for the parallelizable and distributed updates, and the possible communication cost in a distributed environment is in each iteration [4], where is a small constant among the choices of . The classical LBFGS needs an update on the gradient to calculate and this can be implemented by calculating local gradients from different workers and then the local gradients being aggregated on the server. We can still apply similar tricks to LBFGSF but it requires more strict assumptions.
Assumption 1.
[Diagonal Hessian of ] The Hessian of the loss function with respect to the – , is always diagonal.
Remark 1.
This condition is not always true throughout all applications; however, in the cases of leastsquare loss and crossentropy loss, where the prior case has with as the stochastic batch, and in the later case , the Hessian is obviously always diagonal.
Consider a specific batch . If we split it into blocks, where the blocks are denoted as , and assume the corresponding Jacobian block matrices as , then the Hessian vector product with any vector can be written as
and since is diagonal, we can write in the form of diagonal blocks with the sizes to be , thus the above is equivalent to
which means that we can evaluate the FIMvector products with data distributed on different workers and then aggregate them on the server. The communication cost in each round can be .
Theorem 1.
[Distributed Optimization and Communication Cost] Suppose that Assumption 1 holds. Then Algorithm LBFGSF can be implemented in a distributed fashion, with a total communication cost of in each round, where is the number of workers.
2.4 Implementation Details
In this part, we cover important techniques for our stochastic LBFGS framework. The initialization and momentum are crucial in accelerating the algorithm. Meanwhile, keeping positive semidefiniteness is significant for finding the correct direction , especially in the nonconvex setting.
Initialization and Momentum
The initialization is crucial in the LBFGS algorithm. The original LBFGS proposes to use as where is a constant and a commonly great choice suggested is . However, this may not be the case in the stochastic setting where stochasticity can lead to considerable fluctuations in Hessian scalings over the iterations. Therefore, we consider to use a momentum technique where we combine the past firstorder information with the current one. With the recent success of ADAM [12], the scaling of the ADAM stochastic gradient provides excellent and stable performance. The authors evaluate the momentum stochastic gradient: with
and the momentum of the second moment of stochastic gradient
followed by a bias correction step, i.e., and , where is an approximation to the diagonal of the Fisher information matrix [27]. Then ADAM makes a step with a direction .Hence, in our experiments, we estimate with the ADAM preconditioner, i.e., , and apply momentum to update the stochastic gradient with . Note that when the memory in Algorithm 2, our algorithm completely recovers ADAM.
Guarantees of Positive Semidefiniteness
The standard BFGS updates can fail in handling nonconvexity because of difficulty in approximating Hessian with a positive definite matrix [6, 21]
. Even LBFGS with limited updates over each iteration, cannot guarantee the eigenvalues of approximate Hessian bounded above and away from zero. One has to apply a cautious update where the curvature condition
is satisfied in order to maintain the positive definiteness of Hessian approximations [26]. As a wellsuited approach to our algorithm, we employ a cautious strategy [16]: we skip the update, i.e., set , if(8) 
is violated, where is a predefined positive constant. With the stated condition guaranteed at each LBFGS update, the eigenvalues of the Hessian approximations generated by our framework are bounded above and away from zero (Lemma 2).
3 Convergence Analysis
In this section, we study the convergence of our stochastic LBFGS framework. Due to the stochastic batches of the LBFGSF and LBFGSH, by using a fixed learning rate, one cannot establish the convergence to the optimal solution (or firstorder stationary point) but only to a neighborhood of it. We provide theoretical foundations for both strongly convex and nonconvex objectives. Throughout the analysis, we will assume that , the function is Lipschitz continuous or smooth. i.e.,
(9) 
or equivalently,
(10) 
The above implies that is also smooth.
3.1 Strongly Convex Case
Now we are ready to present the theoretical results for strongly convex objectives. Under this circumstance, the global optimal points is unique. Before proceeding, we need to make the following standard assumptions [3] about the objective and the algorithm.
Assumption 2.
Assume that the following assumptions hold.

[noitemsep]

is twice continuously differentiable.

There exist positive constants and such that for all and all batches of size , where refers to the Hessian (approximation) to stabilize .

in Algorithm 2 is symmetric and there exists such that .

The batches are drawn independently and
is an unbiased estimator of the true gradient
for all , i.e., .
We should be aware that Assumption 2B also suggests that there is some such that , i.e., F is strongly convex with and smooth. Because of strong convexity, satisfies:
(11) 
In addition, we should remark here that Assumption 2 is different to the standard assumption in [3] in the sense that we do not require a bounded stochastic gradient assumption since such assumption is barely correct in both theory and practice [25]. We also remark that Assumption 2 is generalization to the corresponding assumption in [3] where by setting , we recover the assumption in [3] so Assumption 2C is not a new assumption. Under the above assumptions, we are able to declare the following lemma that the Hessian approximation formulated by Algorithm 2 are bounded above and away from zero.
Lemma 1.
3.2 Nonconvex Case
Under the following standard nonconvex assumptions [3], we can proceed with the convergence for nonconvex problems for the firstorder stationary points.
Assumption 3.
Assume that the following assumptions hold.

[noitemsep]

is twice continuously differentiable.

There exists a positive constant such that for all batches of size . is smooth.

in Algorithm 2 is symmetric and there exists such that .

The function is bounded below by a scalar .

There exist constants and such that for all and batches of size .

The batches are drawn independently and is an unbiased estimator of the true gradient for all , i.e., .
Similar as the strongly convex case, by setting , Assumption 3B is equivalent to saying that is smooth or is Lipschitz continuous which recovers the corresponding assumption in [3]. However, different from the strongly convex case, here we need the bounded gradient assumption (Assumption 3E). Again, with the help of the above assumptions, we can conclude that bounded above and away from zero as follows.
Lemma 2.
With Lemma 2, the convergence to a neighborhood can also be proven for nonconvex cases.
4 Numerical Experiments
In this section, we present numerical results to illustrate the properties and performance of our proposed algorithms (LBFGSH and LBFGSF) on both convex and nonconvex applications. For comparison, we show performance of popular stochastic gradient algorithms, namely, ADAM [12], ADAGRAD [9] and SGD (momentum SGD). Besides, we include the performance for classical LBFGS where , and a stochastic LBFGS as LBFGSS where we set In the convex setting, we test logistic regression problem on ijcnn1 ^{1}^{1}1Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
. where LBFGSH is identical to LBFGSF because of the linear predictor, so we omit the results for LBFGSF. On the other hand, we show performance of 1hidden layer neural network (with 300 neurons) and LeNet5 (a classical convolutional neural network)
[15] on MNIST. Across all the figures, each epoch refers to a full pass of the dataset, i.e.,
component gradient evaluations.Figure 1 shows suboptimality (training loss for the last column) and test errors of various methods with batch sizes and on the logistic regression problem with ijcnn1 for the first two columns, and LBFGSH exhibits competitive performance with ADAM, SGD and ADAGRAD while LBFGSS seems highly unstable. On the nonconvex examples for the last two columns in the figure, similar results are presented with LBFGSS to be extremely unstable and slow.
To further show the robustness of LBFGSH (LBFGSF), we run each method with different batch sizes and 100 different random seeds on the logistic regression problem with dataset ijcnn1 in Figure 2, and report the results. The dotted lines represent the best and worst performance of the corresponding algorithm and the solid line shows the average performance. Obviously, with large batch sizes, the performance of ADAM, ADAGRAD and SGD worsen while LBFGSH behaves steadily fast and outperforms the others in suboptimality. This also conveys that to achieve the same accuracy, fewer epochs are needed, leading to fewer communications for our framework when the batch size is large.
The ability to use a large batch size is of particular interest in a distributed environment since it allows us to scale to multiple GPUs without reducing the perGPU workload and without sacrificing model accuracy. In order to illustrate the benefit of large batch sizes, we evaluate the stochastic gradient on a neural network with different batch sizes () on a single GPU (Tesla K80), and compare the computational time against that of the pessimistic and utopian cases in Figure 3. Up to , the computational time stays almost constant; nevertheless, with a sufficiently large batch size (), the problem becomes computationally bounded and suffers from the computing resource limited by the single GPU, hence doubling batch size leads to doubling computational time. Therefore, the efficiency of our proposed algorithm shown in Algorithm 2 can benefit tremendously from a distributed environment.
5 Conclusion
We developed a novel framework for the LBFGS method with stochastic batches that is stable and efficient. Based on the framework, we proposed two variants – LBFGSH and LBFGSF, where the latter tries to employ Fisher information matrix instead of the Hessian to approximate the difference of gradients. LBFGSF also admits a distributed implementation. We show that our framework converges linearly to a neighborhood of the optimal solution for convex and nonconvex settings under standard assumptions, and provide numerical experiments on both convex applications and nonconvex neural networks.
Acknowledgments
Jie Liu was partially supported by the IBM PhD Fellowship. Martin Takáč was partially supported by the U.S. National Science Foundation, under award number NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796. We would like to thank Courtney Paquette for her valuable advice on the paper.
References
 [1] ShunIchi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
 [2] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality and the smo algorithm. In ICML, 2004.

[3]
Albert S. Berahas, Jorge Nocedal, and Martin Takáč.
A multibatch LBFGS method for machine learning.
In NIPS, pages 1055–1063, 2016.  [4] W. Chen, Z. Wang, and J. Zhou. Largescale LBFGS using mapreduce. In NIPS, pages 1332–1340, 2014.

[5]
David Roxbee Cox.
The regression analysis of binary sequences.
Journal of the Royal Statistical Society, 20(2):215–242, 1958.  [6] YuHong Dai. Convergence properties of the BFGS algoritm. SIAM Journal on Optimization, 13(3):693–701, 2002.
 [7] Aaron Defazio, Francis Bach, and Simon LacosteJulien. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
 [8] Aaron Defazio, Justin Domke, and Tibério Caetano. A faster, permutable incremental gradient method for big data problems. In ICML, pages 1125–1133, 2014.
 [9] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 [10] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edition, 2009.

[11]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In NIPS, pages 315–323, 2013.  [12] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [13] Jakub Konečný, Jie Liu, Peter Richtárik, and Martin Takáč. Minibatch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10:242–255, 2016.
 [14] John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001.
 [15] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [16] DongHui Li and Masao Fukushima. A modified BFGS method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics, 129(12):15–35, 2001.
 [17] Julien Mairal. Optimization with firstorder surrogate functions. In ICML, pages 783–791, 2013.
 [18] James Martens. Deep learning via hessianfree optimization. In ICML, pages 735–742, 2010.
 [19] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 [20] James Martens and Roger B. Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In ICML, pages 2408–2417, 2015.
 [21] Walter F Mascarenhas. The BFGS method with exact line searches fails for nonconvex objective functions. Mathematical Programming, 99(1):49–61, 2004.
 [22] A. Meenakshi and C. Rajian. On a product of positive semidefinite matrices. Linear Algebra and its Applications, 295(1):3–6, 1999.
 [23] Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML, pages 2613–2621, 2017.
 [24] Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261, 2017.
 [25] Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takáč. SGD and Hogwild! convergence without the bounded gradients assumption. In ICML, 2018.
 [26] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
 [27] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. ICLR, 2014.
 [28] Michael JD Powell. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. Nonlinear programming, 9(1):53–72, 1976.
 [29] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 [30] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2016.
 [31] Shai ShalevShwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated subgradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
 [32] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
 [33] Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.
Appendix A Assumptions, Lemmas and Theorems
Lemma 3 (Lemma 4 in [13]).
Let be vectors in and . Let be a random subset of of size , chosen uniformly at random from all subsets of this cardinality. Taking expectation with respect to , we have
(12) 
Lemma 4 (Lemma 3 in [25] and Equation (10) in [11]).
If s are convex and smooth , then ,
(13) 
where .
Lemma 5.
If is strongly convex with and s are smooth , then , the batch gradient has the following bound,
(14) 
where and . If we further have s convex, the bound shrinks to
(15) 
See 1
See 3
Theorem 4 (A complete version of Theorem 2).
Suppose that Assumptions 2AD hold, and let where is the minimizer of . Let be the iterates generated by the stochastic LBFGS framework with a constant learning rate and with generated by Algorithm 2. Then for all ,
where and . If we further have s convex, then similarly, with a learning rate , we have
Theorem 5.
Suppose that Assumptions 2AD hold, s are convex and let where is the minimizer of . Let be the iterates generated by the stochastic LBFGS framework with , where and satisfies
Then starting from , for all ,
where
(16) 
Appendix B Proofs
b.1 Proof of Lemma 1
Proof.
Instead of analyzing the algorithm in , we study the Hessian approximation where . In this case, the LBFGS are updated as follows (note that the superscript of denotes the iteration of Hessian updates in each iteration).

Set such that
(17) 
For set and compute

Set
Note that following the above updates, the curvature pairs are
It is also easy to know that for LBFGSH,
Comments
There are no comments yet.