We consider the finite-sum minimization problem of the form
In general, the problem of form (1
) covers a wide range of convex and nonconvex problems including logistic regression, multi-kernel learning[2, 33], conditional random fields 10], etc. Classical first-order methods to solve (1) are gradient descent (GD) 
and stochastic gradient descent (SGD)[29, 31]. A large class of optimization methods can be used to solve (1), where the iterative updates can be generalized as follows,
where is some descent direction, is an inverse Hessian approximation of at , and is an estimate of .
is an identity matrix, the update is considered a first-order method. Numerous work has focused on the choice ofsuch as SAG/SAGA [30, 7], MISO/FINITO [17, 8], SDCA , SVRG/S2GD [11, 13], SARAH [23, 24]. Nevertheless, with the importance of second-order optimization providing potential curvature around local optima and thus promoting fast convergence, the choice of non-identity is crucial to the development of modern optimization algorithms.
Within the framework of second-order optimization, a popular choice for is the inverse Hessian; however, we lack an efficient way to invert matrices, leading to increases in computation and communication costs to a problem for the distributed setting. Motivated by this, quasi-Newton methods, among which BFGS is one of the most popular, were developed, including a practical variant named limited-memory BFGS (L-BFGS) . It has been widely known that batch methods have been successfully applied in first-order algorithms and provide effective improvements, but it remains a problem for L-BFGS due to the instability caused by randomness between different gradient evaluations. Therefore, the development of an efficient and stable L-BFGS is necessary.
Our contributions. In this paper, we analyze L-BFGS with stochastic batches for both convex and nonconvex optimization, as well as its distributed implementation. LBFGS-H originates from the idea of L-BFGS and uses Hessian information to approximate the differences of gradients. LBFGS-F combines L-BFGS with Fisher information matrix from the natural gradient algorithm [1, 19, 27]. We show that they are efficient for minimizing finite-sum problems both in theory and in practice. The key contributions of our paper are summarized as follows.
We propose a framework for approximating the differences of gradients in the L-BFGS algorithm that ensures stability for the general finite-sum problems. We show it converges linearly to a neighborhood of the optimal solution for convex and nonconvex finite-sum problems under standard assumptions .
In a distributed environment, we introduce a variant LBFGS-F where the Hessian matrix for approximating gradient differences is replaced by the Fisher information matrix .
With a potential acceleration in practice using ADAM techniques , we verify the competitive performances of both LBFGS-H and LBFGS-F against mainstream optimization methods in both convex and nonconvex applications.
2 The Algorithm
In this section, we propose a new stochastic L-BFGS framework, as well as its distributed implementation with Fisher information matrix. Before proceeding to the new algorithm, let us revisit the procedure for the classical L-BFGS.
2.1 Limited-memory BFGS
The classical L-BFGS algorithm  is presented as below.
In each iteration, first, we estimate the direction by using curvature pairs . Then, the learning rate is chosen such that certain condition (e.g. line search) is satisfied, and we make an update. Last, we evaluate the curvature pairs and replace the pairs stored in the memory while keeping the number of curvature pairs no larger than . The key step in this procedure is the evaluation of the search direction using the curvature pairs, which appears as the well-known two-loop recursion (Algorithm 0(b)). Note that in the classical L-BFGS, the main algorithm usually applies a line-search technique for choosing the learning rate .
The intrinsic idea within L-BFGS is to utilize the curvature information implied by the vector pairs to help regularize the gradient direction. However, within the setting of stochastic batches, the update of , where the batch gradient is defined as
makes it difficult to stabilize the behavior of the algorithm. One of the remedies is to assume that there is an overlap between the samples and , i.e., , and replace the with in . However, this idea requires the batch size to be large enough.
Recall from the Taylor expansion for a multivariate vector-valued function
where is the Jacobian matrix with respect to , and has all elements to be . Hence, we can conclude that: when is close to ,
where is the Hessian at , which is exactly the secant equation in BFGS. Therefore, another possible remedy to stabilize L-BFGS is to approximate the differences of gradient using (approximate) second-order information, i.e.,
2.2 Stochastic L-BFGS with Hessian Information and Vector-free Two-loop Recursion
The proposed algorithm of stochastic L-BFGS with Hessian information (LBFGS-H) is formulated by replacing with the stochastic version of in Algorithm 0(a), i.e.,
where is the stochastic batch picked at iteration and .
For an efficient implementation in a map-reduce environment (e.g. Hadoop, Spark), we use a vector-free L-BFGS (VL-BFGS) update in Algorithm 2 originated from  for the two-loop recursion.  proposes a vector-free L-BFGS based on the classical L-BFGS where they set . However, the choice of is very important, therefore we propose the vector-free L-BFGS algorithm applicable to any feasible as follows.
In details, if we observe the direction generated by the two-loop recursion in Algorithm 0(b), we are able to figure out that we can represent the output direction using the invariable base vectors, i.e.,
The direction after the first loop can be written as , and after we scale the direction with we obtain , so the final result of the two-loop recursion can be written as
Note that the coefficients are evaluated with only dot-products which are defined in the matrix of the following form:
However, in the second loop which contributes to the coefficient evaluations of , from to ,
when we define a vector with the elements .
2.3 Fisher Information Matrix as a Hessian Approximation and Distributed Optimization
When we have no access to the second-order information, instead of utilizing , we are still able to use approximations of . Recently, numerous research has been conducted on the natural gradient algorithm, where in the update (2), the inverse Fisher information matrix serves as [1, 20].
If we further consider the cost function in (1) as , where , is a convex loss and is some network structure, then an element of the Hessian matrix can be written as:
where the first term is the component of the Hessian due to variation in ; since we are only looking at variation in , we do effectively a change of basis using the Jacobian of . The second term, on the other hand, is the component that is due to variation in , which is why we see the Hessian of . As it goes to the neighborhood of the minimum of the cost , the first derivatives are approaching zero, which indicates that the second term is negligible. However, the first term, as an approximation of Hessian, which can be written as the following in the matrix form, is no different but identical to the Generalized Gauss-Newton matrix (GGN) ,
where is the Jacobian matrix of with respect to at , is the Hessian matrix of with respect to at and we use to denote the Hessian or Hessian approximation used to smoothen , i.e., .
It has been verified that in the cases of popular loss functions such as cross-entropy and least-squares, the GGN matrix is exactly the Fisher information matrix (FIM). Note that here can be nonconvex which covers the applications of neural networks. Under the framework of stochastic L-BFGS we propose, we introduce the stochastic L-BFGS with Fisher information (LBFGS-F) by replacing in (3) with the FIM. Note that when the predictor is linear, i.e. , with the loss function as either the cross-entropy or the least-squares, LBFGS-F is identical to LBFGS-H (GGN = FIM = Hessian). Similarly, this also applies to the batch version of the FIM.
As a map-reduce implementation of L-BFGS, the VL-BFGS update is praised for the parallelizable and distributed updates, and the possible communication cost in a distributed environment is in each iteration , where is a small constant among the choices of . The classical L-BFGS needs an update on the gradient to calculate and this can be implemented by calculating local gradients from different workers and then the local gradients being aggregated on the server. We can still apply similar tricks to LBFGS-F but it requires more strict assumptions.
[Diagonal Hessian of ] The Hessian of the loss function with respect to the – , is always diagonal.
This condition is not always true throughout all applications; however, in the cases of least-square loss and cross-entropy loss, where the prior case has with as the stochastic batch, and in the later case , the Hessian is obviously always diagonal.
Consider a specific batch . If we split it into blocks, where the blocks are denoted as , and assume the corresponding Jacobian block matrices as , then the Hessian vector product with any vector can be written as
and since is diagonal, we can write in the form of diagonal blocks with the sizes to be , thus the above is equivalent to
which means that we can evaluate the FIM-vector products with data distributed on different workers and then aggregate them on the server. The communication cost in each round can be .
[Distributed Optimization and Communication Cost] Suppose that Assumption 1 holds. Then Algorithm LBFGS-F can be implemented in a distributed fashion, with a total communication cost of in each round, where is the number of workers.
2.4 Implementation Details
In this part, we cover important techniques for our stochastic LBFGS framework. The initialization and momentum are crucial in accelerating the algorithm. Meanwhile, keeping positive semi-definiteness is significant for finding the correct direction , especially in the nonconvex setting.
Initialization and Momentum
The initialization is crucial in the L-BFGS algorithm. The original L-BFGS proposes to use as where is a constant and a commonly great choice suggested is . However, this may not be the case in the stochastic setting where stochasticity can lead to considerable fluctuations in Hessian scalings over the iterations. Therefore, we consider to use a momentum technique where we combine the past first-order information with the current one. With the recent success of ADAM , the scaling of the ADAM stochastic gradient provides excellent and stable performance. The authors evaluate the momentum stochastic gradient: with
and the momentum of the second moment of stochastic gradientfollowed by a bias correction step, i.e., and , where is an approximation to the diagonal of the Fisher information matrix . Then ADAM makes a step with a direction .
Hence, in our experiments, we estimate with the ADAM preconditioner, i.e., , and apply momentum to update the stochastic gradient with . Note that when the memory in Algorithm 2, our algorithm completely recovers ADAM.
Guarantees of Positive Semi-definiteness
. Even L-BFGS with limited updates over each iteration, cannot guarantee the eigenvalues of approximate Hessian bounded above and away from zero. One has to apply a cautious update where the curvature conditionis satisfied in order to maintain the positive definiteness of Hessian approximations . As a well-suited approach to our algorithm, we employ a cautious strategy : we skip the update, i.e., set , if
is violated, where is a predefined positive constant. With the stated condition guaranteed at each L-BFGS update, the eigenvalues of the Hessian approximations generated by our framework are bounded above and away from zero (Lemma 2).
3 Convergence Analysis
In this section, we study the convergence of our stochastic L-BFGS framework. Due to the stochastic batches of the LBFGS-F and LBFGS-H, by using a fixed learning rate, one cannot establish the convergence to the optimal solution (or first-order stationary point) but only to a neighborhood of it. We provide theoretical foundations for both strongly convex and nonconvex objectives. Throughout the analysis, we will assume that , the function is -Lipschitz continuous or -smooth. i.e.,
The above implies that is also -smooth.
3.1 Strongly Convex Case
Now we are ready to present the theoretical results for strongly convex objectives. Under this circumstance, the global optimal points is unique. Before proceeding, we need to make the following standard assumptions  about the objective and the algorithm.
Assume that the following assumptions hold.
is twice continuously differentiable.
There exist positive constants and such that for all and all batches of size , where refers to the Hessian (approximation) to stabilize .
in Algorithm 2 is symmetric and there exists such that .
The batches are drawn independently and
is an unbiased estimator of the true gradientfor all , i.e., .
We should be aware that Assumption 2B also suggests that there is some such that , i.e., F is strongly convex with and -smooth. Because of -strong convexity, satisfies:
In addition, we should remark here that Assumption 2 is different to the standard assumption in  in the sense that we do not require a bounded stochastic gradient assumption since such assumption is barely correct in both theory and practice . We also remark that Assumption 2 is generalization to the corresponding assumption in  where by setting , we recover the assumption in  so Assumption 2C is not a new assumption. Under the above assumptions, we are able to declare the following lemma that the Hessian approximation formulated by Algorithm 2 are bounded above and away from zero.
3.2 Nonconvex Case
Under the following standard nonconvex assumptions , we can proceed with the convergence for nonconvex problems for the first-order stationary points.
Assume that the following assumptions hold.
is twice continuously differentiable.
There exists a positive constant such that for all batches of size . is -smooth.
in Algorithm 2 is symmetric and there exists such that .
The function is bounded below by a scalar .
There exist constants and such that for all and batches of size .
The batches are drawn independently and is an unbiased estimator of the true gradient for all , i.e., .
Similar as the strongly convex case, by setting , Assumption 3B is equivalent to saying that is -smooth or is -Lipschitz continuous which recovers the corresponding assumption in . However, different from the strongly convex case, here we need the bounded gradient assumption (Assumption 3E). Again, with the help of the above assumptions, we can conclude that bounded above and away from zero as follows.
With Lemma 2, the convergence to a neighborhood can also be proven for nonconvex cases.
4 Numerical Experiments
In this section, we present numerical results to illustrate the properties and performance of our proposed algorithms (LBFGS-H and LBFGS-F) on both convex and nonconvex applications. For comparison, we show performance of popular stochastic gradient algorithms, namely, ADAM , ADAGRAD  and SGD (momentum SGD). Besides, we include the performance for classical L-BFGS where , and a stochastic L-BFGS as LBFGS-S where we set In the convex setting, we test logistic regression problem on ijcnn1 111Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
. where LBFGS-H is identical to LBFGS-F because of the linear predictor, so we omit the results for LBFGS-F. On the other hand, we show performance of 1-hidden layer neural network (with 300 neurons) and LeNet-5 (a classical convolutional neural network) on MNIST
. Across all the figures, each epoch refers to a full pass of the dataset, i.e.,component gradient evaluations.
Figure 1 shows sub-optimality (training loss for the last column) and test errors of various methods with batch sizes and on the logistic regression problem with ijcnn1 for the first two columns, and LBFGS-H exhibits competitive performance with ADAM, SGD and ADAGRAD while LBFGS-S seems highly unstable. On the nonconvex examples for the last two columns in the figure, similar results are presented with LBFGS-S to be extremely unstable and slow.
To further show the robustness of LBFGS-H (LBFGS-F), we run each method with different batch sizes and 100 different random seeds on the logistic regression problem with dataset ijcnn1 in Figure 2, and report the results. The dotted lines represent the best and worst performance of the corresponding algorithm and the solid line shows the average performance. Obviously, with large batch sizes, the performance of ADAM, ADAGRAD and SGD worsen while LBFGS-H behaves steadily fast and outperforms the others in sub-optimality. This also conveys that to achieve the same accuracy, fewer epochs are needed, leading to fewer communications for our framework when the batch size is large.
The ability to use a large batch size is of particular interest in a distributed environment since it allows us to scale to multiple GPUs without reducing the per-GPU workload and without sacrificing model accuracy. In order to illustrate the benefit of large batch sizes, we evaluate the stochastic gradient on a neural network with different batch sizes () on a single GPU (Tesla K80), and compare the computational time against that of the pessimistic and utopian cases in Figure 3. Up to , the computational time stays almost constant; nevertheless, with a sufficiently large batch size (), the problem becomes computationally bounded and suffers from the computing resource limited by the single GPU, hence doubling batch size leads to doubling computational time. Therefore, the efficiency of our proposed algorithm shown in Algorithm 2 can benefit tremendously from a distributed environment.
We developed a novel framework for the L-BFGS method with stochastic batches that is stable and efficient. Based on the framework, we proposed two variants – LBFGS-H and LBFGS-F, where the latter tries to employ Fisher information matrix instead of the Hessian to approximate the difference of gradients. LBFGS-F also admits a distributed implementation. We show that our framework converges linearly to a neighborhood of the optimal solution for convex and nonconvex settings under standard assumptions, and provide numerical experiments on both convex applications and nonconvex neural networks.
Jie Liu was partially supported by the IBM PhD Fellowship. Martin Takáč was partially supported by the U.S. National Science Foundation, under award number NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796. We would like to thank Courtney Paquette for her valuable advice on the paper.
-  Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
-  Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality and the smo algorithm. In ICML, 2004.
Albert S. Berahas, Jorge Nocedal, and Martin Takáč.
A multi-batch L-BFGS method for machine learning.In NIPS, pages 1055–1063, 2016.
-  W. Chen, Z. Wang, and J. Zhou. Large-scale L-BFGS using mapreduce. In NIPS, pages 1332–1340, 2014.
David Roxbee Cox.
The regression analysis of binary sequences.Journal of the Royal Statistical Society, 20(2):215–242, 1958.
-  Yu-Hong Dai. Convergence properties of the BFGS algoritm. SIAM Journal on Optimization, 13(3):693–701, 2002.
-  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
-  Aaron Defazio, Justin Domke, and Tibério Caetano. A faster, permutable incremental gradient method for big data problems. In ICML, pages 1125–1133, 2014.
-  John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
-  Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edition, 2009.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In NIPS, pages 315–323, 2013.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  Jakub Konečný, Jie Liu, Peter Richtárik, and Martin Takáč. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10:242–255, 2016.
-  John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001.
-  Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
-  Dong-Hui Li and Masao Fukushima. A modified BFGS method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics, 129(1-2):15–35, 2001.
-  Julien Mairal. Optimization with first-order surrogate functions. In ICML, pages 783–791, 2013.
-  James Martens. Deep learning via hessian-free optimization. In ICML, pages 735–742, 2010.
-  James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
-  James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, pages 2408–2417, 2015.
-  Walter F Mascarenhas. The BFGS method with exact line searches fails for non-convex objective functions. Mathematical Programming, 99(1):49–61, 2004.
-  A. Meenakshi and C. Rajian. On a product of positive semidefinite matrices. Linear Algebra and its Applications, 295(1):3–6, 1999.
-  Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML, pages 2613–2621, 2017.
-  Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261, 2017.
-  Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takáč. SGD and Hogwild! convergence without the bounded gradients assumption. In ICML, 2018.
-  Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
-  Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. ICLR, 2014.
-  Michael JD Powell. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. Nonlinear programming, 9(1):53–72, 1976.
-  Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
-  Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2016.
-  Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
-  Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
-  Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.
Appendix A Assumptions, Lemmas and Theorems
Lemma 3 (Lemma 4 in ).
Let be vectors in and . Let be a random subset of of size , chosen uniformly at random from all subsets of this cardinality. Taking expectation with respect to , we have
If s are convex and -smooth , then ,
If is strongly convex with and s are -smooth , then , the batch gradient has the following bound,
where and . If we further have s convex, the bound shrinks to
Theorem 4 (A complete version of Theorem 2).
Suppose that Assumptions 2A-D hold, and let where is the minimizer of . Let be the iterates generated by the stochastic L-BFGS framework with a constant learning rate and with generated by Algorithm 2. Then for all ,
where and . If we further have s convex, then similarly, with a learning rate , we have
Suppose that Assumptions 2A-D hold, s are convex and let where is the minimizer of . Let be the iterates generated by the stochastic L-BFGS framework with , where and satisfies
Then starting from , for all ,
Appendix B Proofs
b.1 Proof of Lemma 1
Instead of analyzing the algorithm in , we study the Hessian approximation where . In this case, the L-BFGS are updated as follows (note that the superscript of denotes the iteration of Hessian updates in each iteration).
Set such that
For set and compute
Note that following the above updates, the curvature pairs are
It is also easy to know that for LBFGS-H,