# A Multi-Batch L-BFGS Method for Machine Learning

The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

## Authors

• 7 publications
• 6 publications
• 30 publications
• ### A Robust Multi-Batch L-BFGS Method for Machine Learning

This paper describes an implementation of the L-BFGS method designed to ...
07/26/2017 ∙ by Albert S. Berahas, et al. ∙ 0

• ### A Progressive Batching L-BFGS Method for Machine Learning

The standard L-BFGS method relies on gradient approximations that are no...
02/15/2018 ∙ by Raghu Bollapragada, et al. ∙ 0

• ### BackPACK: Packing more into backprop

Automatic differentiation frameworks are optimized for exactly one thing...
12/23/2019 ∙ by Felix Dangel, et al. ∙ 10

• ### Quasi-Newton Optimization in Deep Q-Learning for Playing ATARI Games

Reinforcement Learning (RL) algorithms allow artificial agents to improv...
11/06/2018 ∙ by Jacob Rafati, et al. ∙ 0

• ### Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent

In this paper, we study the stochastic gradient descent method in analyz...
05/22/2017 ∙ by Chris Junchi Li, et al. ∙ 0

• ### Small steps and giant leaps: Minimal Newton solvers for Deep Learning

We propose a fast second-order method that can be used as a drop-in repl...
05/21/2018 ∙ by João F. Henriques, et al. ∙ 0

• ### First-order Newton-type Estimator for Distributed Estimation and Inference

This paper studies distributed estimation and inference for a general st...
11/28/2018 ∙ by Xi Chen, et al. ∙ 0

## Code Repositories

### NIPS2016

This project collects the different accepted papers and their link to Arxiv or Gitxiv

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

It is common in machine learning to encounter optimization problems involving millions of parameters and very large datasets. To deal with the computational demands imposed by such applications, high performance implementations of stochastic gradient and batch quasi-Newton methods have been developed agarwal2014reliable ; dean2012large ; chen2014large . In this paper we study a batch approach based on the L-BFGS method mybook that strives to reach the right balance between efficient learning and productive parallelism.

In supervised learning, one seeks to minimize empirical risk,

 F(w):=1nn∑i=1f(w;xi,yi)def=1nn∑i=1fi(w),

where denote the training examples and is the composition of a prediction function (parametrized by

) and a loss function. The training problem consists of finding an optimal choice of the parameters

with respect to , i.e.,

 minw∈RdF(w)=1nn∑i=1fi(w). (1.1)

At present, the preferred optimization method is the stochastic gradient descent (SGD) method RobMon51 ; bottou-lecun-2004 , and its variants johnson2013accelerating ; Schmidt2016 ; defazio2014saga , which are implemented either in an asynchronous manner (e.g. when using a parameter server in a distributed setting) or following a synchronous mini-batch approach that exploits parallelism in the gradient evaluation bertsekas1989parallel ; recht2011hogwild ; Goodfellow-et-al-2016-Book . A drawback of the asynchronous approach is that it cannot use large batches, as this would cause updates to become too dense and compromise the stability and scalability of the method mania2015perturbed ; recht2011hogwild . As a result, the algorithm spends more time in communication as compared to computation. On the other hand, using a synchronous mini-batch approach one can achieve a near-linear decrease in the number of SGD iterations as the mini-batch size is increased, up to a certain point after which the increase in computation is not offset by the faster convergence takavc2013mini .

An alternative to SGD is a batch method, such as L-BFGS, which is able to reach high training accuracy and allows one to perform more computation per node, so as to achieve a better balance with communication costs zhang2015disco . Batch methods are, however, not as efficient learning algorithms as SGD in a sequential setting bousquet2008tradeoffs . To benefit from the strength of both methods some high performance systems employ SGD at the start and later switch to a batch method agarwal2014reliable .

Multi-Batch Method. In this paper, we follow a different approach consisting of a single method that selects a sizeable subset (batch) of the training data to compute a step, and changes this batch at each iteration to improve the learning abilities of the method. We call this a multi-batch

approach to differentiate it from the mini-batch approach used in conjunction with SGD, which employs a very small subset of the training data. When using large batches it is natural to employ a quasi-Newton method, as incorporating second-order information imposes little computational overhead and improves the stability and speed of the method. We focus here on the L-BFGS method, which employs gradient information to update an estimate of the Hessian and computes a step in

flops, where is the number of variables. The multi-batch approach can, however, cause difficulties to L-BFGS because this method employs gradient differences to update Hessian approximations. When the gradients used in these differences are based on different data points, the updating procedure can be unstable. Similar difficulties arise in a parallel implementation of the standard L-BFGS method, if some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time — as this again amounts to using different data points to evaluate the function and gradient at the beginning and the end of the iteration. The goal of this paper is to show that stable quasi-Newton updating can be achieved in both settings without incurring extra computational cost, or special synchronization. The key is to perform quasi-Newton updating based on the overlap between consecutive batches. The only restriction is that this overlap should not be too small, something that can be achieved in most situations.

Contributions. We describe a novel implementation of the batch L-BFGS method that is robust in the absence of sample consistency; i.e., when different samples are used to evaluate the objective function and its gradient at consecutive iterations. The numerical experiments show that the method proposed in this paper — which we call the multi-batch L-BFGS method — achieves a good balance between computation and communication costs. We also analyze the convergence properties of the new method (using a fixed step length strategy) on both convex and nonconvex problems.

## 2 The Multi-Batch Quasi-Newton Method

In a pure batch approach, one applies a gradient based method, such as L-BFGS mybook , to the deterministic optimization problem (1.1). When the number of training examples is large, it is natural to parallelize the evaluation of and by assigning the computation of the component functions to different processors. If this is done on a distributed platform, it is possible for some of the computational nodes to be slower than the rest. In this case, the contribution of the slow (or unresponsive) computational nodes could be ignored given the stochastic nature of the objective function. This leads, however, to an inconsistency in the objective function and gradient at the beginning and at the end of the iteration, which can be detrimental to quasi-Newton methods. Thus, we seek to find a fault-tolerant variant of the batch L-BFGS method that is capable of dealing with slow or unresponsive computational nodes.

A similar challenge arises in a multi-batch implementation of the L-BFGS method in which the entire training set is not employed at every iteration, but rather, a subset of the data is used to compute the gradient. Specifically, we consider a method in which the dataset is randomly divided into a number of batches — say 10, 50, or 100 — and the minimization is performed with respect to a different batch at every iteration. At the -th iteration, the algorithm chooses a batch , computes

 FSk(wk)=1|Sk|∑i∈Skfi(wk),∇FSk(wk)=gSkk=1|Sk|∑i∈Sk∇fi(wk), (2.2)

and takes a step along the direction , where is an approximation to . Allowing the sample to change freely at every iteration gives this approach flexibility of implementation and is beneficial to the learning process, as we show in Section 4. (We refer to as the sample of training points, even though only indexes those points.)

The case of unresponsive computational nodes and the multi-batch method are similar. The main difference is that node failures create unpredictable changes to the samples , whereas a multi-batch method has control over sample generation. In either case, the algorithm employs a stochastic approximation to the gradient and can no longer be considered deterministic. We must, however, distinguish our setting from that of the classical SGD method, which employs small mini-batches and noisy gradient approximations. Our algorithm operates with much larger batches so that distributing the function evaluation is beneficial and the compute time of

is not overwhelmed by communication costs. This gives rise to gradients with relatively small variance and justifies the use of a second-order method such as L-BFGS.

Robust Quasi-Newton Updating. The difficulties created by the use of a different sample at each iteration can be circumvented if consecutive samples and overlap, so that One can then perform stable quasi-Newton updating by computing gradient differences based on this overlap, i.e., by defining

 yk+1=gOkk+1−gOkk,sk+1=wk+1−wk, (2.3)

in the notation given in (2.2). The correction pair can then be used in the BFGS update. When the overlap set is not too small, is a useful approximation of the curvature of the objective function along the most recent displacement, and will lead to a productive quasi-Newton step. This observation is based on an important property of Newton-like methods, namely that there is much more freedom in choosing a Hessian approximation than in computing the gradient byrd2011use ; bollapragada2016exact . Thus, a smaller sample can be employed for updating the inverse Hessian approximation than for computing the batch gradient in the search direction . In summary, by ensuring that unresponsive nodes do not constitute the vast majority of all working nodes in a fault-tolerant parallel implementation, or by exerting a small degree of control over the creation of the samples in the multi-batch method, one can design a robust method that naturally builds upon the fundamental properties of BFGS updating.

We should mention in passing that a commonly used strategy for ensuring stability of quasi-Newton updating in machine learning is to enforce gradient consistency schraudolph2007stochastic , i.e., to use the same sample to compute gradient evaluations at the beginning and the end of the iteration. Another popular remedy is to use the same batch for multiple iterations ngiam2011optimization , alleviating the gradient inconsistency problem at the price of slower convergence. In this paper, we assume that achieving such sample consistency is not possible (in the fault-tolerant case) or desirable (in a multi-batch framework), and wish to design a new variant of L-BFGS that imposes minimal restrictions in the sample changes.

### 2.1 Specification of the Method

At the -th iteration, the multi-batch BFGS algorithm chooses a set and computes a new iterate

 wk+1=wk−αkHkgSkk, (2.4)

where is the step length, is the batch gradient (2.2) and is the inverse BFGS Hessian matrix approximation that is updated at every iteration by means of the formula

 Hk+1=VTkHkVk+ρksksTk,ρk=1yTksk,Vk=I−ρkyksTk.

To compute the correction vectors

, we determine the overlap set consisting of the samples that are common at the -th and -st iterations. We define

 FOk(wk)=1|Ok|∑i∈Okfi(wk),∇FOk(wk)=gOkk=1|Ok|∑i∈Ok∇fi(wk),

and compute the correction vectors as in (2.3). In this paper we assume that is constant.

In the limited memory version, the matrix is defined at each iteration as the result of applying

BFGS updates to a multiple of the identity matrix, using a set of

correction pairs kept in storage. The memory parameter is typically in the range 2 to 20. When computing the matrix-vector product in (2.4) it is not necessary to form that matrix since one can obtain this product via the two-loop recursion mybook , using the most recent correction pairs . After the step has been computed, the oldest pair is discarded and the new curvature pair is stored.

A pseudo-code of the proposed method is given below, and depends on several parameters. The parameter denotes the fraction of samples in the dataset used to define the gradient, i.e., . The parameter denotes the length of overlap between consecutive samples, and is defined as a fraction of the number of samples in a given batch , i.e., .

### 2.2 Sample Generation

We now discuss how the sample is created at each iteration (Line 8 in Algorithm 1).

Distributed Computing with Faults. Consider a distributed implementation in which slave nodes read the current iterate from the master node, compute a local gradient on a subset of the dataset, and send it back to the master node for aggregation in the calculation (2.2). Given a time (computational) budget, it is possible for some nodes to fail to return a result. The schematic in Figure 1a illustrates the gradient calculation across two iterations, and , in the presence of faults. Here , denote the batches of data that each slave node receives (where ), and is the gradient calculation using all nodes that responded within the preallocated time.

Let and be the set of indices of all nodes that returned a gradient at the -th and -st iterations, respectively. Using this notation and , and we define . The simplest implementation in this setting preallocates the data on each compute node, requiring minimal data communication, i.e., only one data transfer. In this case the samples will be independent if node failures occur randomly. On the other hand, if the same set of nodes fail, then sample creation will be biased, which is harmful both in theory and practice. One way to ensure independent sampling is to shuffle and redistribute the data to all nodes after a certain number of iterations.

Multi-batch Sampling. We propose two strategies for the multi-batch setting.

Figure 1b illustrates the sample creation process in the first strategy. The dataset is shuffled and batches are generated by collecting subsets of the training set, in order. Every set (except ) is of the form , where and are the overlapping samples with batches and respectively, and are the samples that are unique to batch

. After each pass through the dataset, the samples are reshuffled, and the procedure described above is repeated. In our implementation samples are drawn without replacement, guaranteeing that after every pass (epoch) all samples are used. This strategy has the advantage that it requires no extra computation in the evaluation of

and , but the samples are not independent.

The second sampling strategy is simpler and requires less control. At every iteration , a batch is created by randomly selecting elements from . The overlapping set is then formed by randomly selecting elements from (subsampling). This strategy is slightly more expensive since requires extra computation, but if the overlap is small this cost is not significant.

## 3 Convergence Analysis

In this section, we analyze the convergence properties of the multi-batch L-BFGS method (Algorithm 1) when applied to the minimization of strongly convex and nonconvex objective functions, using a fixed step length strategy. We assume that the goal is to minimize the empirical risk given in (1.1), but note that a similar analysis could be used to study the minimization of the expected risk.

### 3.1 Strongly Convex case

Due to the stochastic nature of the multi-batch approach, every iteration of Algorithm 1 employs a gradient that contains errors that do not converge to zero. Therefore, by using a fixed step length strategy one cannot establish convergence to the optimal solution , but only convergence to a neighborhood of nedic2001convergence . Nevertheless, this result is of interest as it reflects the common practice of using a fixed step length and decreasing it only if the desired testing error has not been achieved. It also illustrates the tradeoffs that arise between the size of the batch and the step length.

In our analysis, we make the following assumptions about the objective function and the algorithm.

Assumptions A.

1. [leftmargin=0.5cm,topsep=0pt,itemsep=0ex,,partopsep=0ex,parsep=0ex]

2. is twice continuously differentiable.

3. There exist positive constants and such that for all and all sets .

4. There is a constant such that for all and all sets .

5. The samples are drawn independently and

is an unbiased estimator of the true gradient

for all , i.e.,

Note that Assumption implies that the entire Hessian also satisfies

 λI⪯∇2F(w)⪯ΛI,∀w∈Rd,

for some constants . Assuming that every sub-sampled function is strongly convex is not unreasonable as a regularization term is commonly added in practice when that is not the case.

We begin by showing that the inverse Hessian approximations

generated by the multi-batch L-BFGS method have eigenvalues that are uniformly bounded above and away from zero. The proof technique used is an adaptation of that in

Sammy_SQN .

###### Lemma 3.1.

If Assumptions A.1-A.2 above hold, there exist constants such that the Hessian approximations generated by Algorithm 1 satisfy

 μ1I⪯Hk⪯μ2I,for k=0,1,2,…

Utilizing Lemma 3.1, we show that the multi-batch L-BFGS method with a constant step length converges to a neighborhood of the optimal solution.

###### Theorem 3.2.

Suppose that Assumptions A.1-A.4 hold and let , where is the minimizer of . Let be the iterates generated by Algorithm 1 with , starting from . Then for all ,

 E[F(wk)−F⋆] ≤(1−2αμ1λ)k[F(w0)−F⋆]+[1−(1−αμ1λ)k]αμ22γ2Λ4μ1λ k→∞−−−→αμ22γ2Λ4μ1λ.

The bound provided by this theorem has two components: (i) a term decaying linearly to zero, and (ii) a term identifying the neighborhood of convergence. Note that a larger step length yields a more favorable constant in the linearly decaying term, at the cost of an increase in the size of the neighborhood of convergence. We will consider again these tradeoffs in Section 4, where we also note that larger batches increase the opportunities for parallelism and improve the limiting accuracy in the solution, but slow down the learning abilities of the algorithm.

One can establish convergence of the multi-batch L-BFGS method to the optimal solution by employing a sequence of step lengths that converge to zero according to the schedule proposed by Robbins and Monro RobMon51 . However, that provides only a sublinear rate of convergence, which is of little interest in our context where large batches are employed and some type of linear convergence is expected. In this light, Theorem 3.2 is more relevant to practice.

### 3.2 Nonconvex case

The BFGS method is known to fail on noconvex problems mascarenhas2004bfgs ; dai2002convergence . Even for L-BFGS, which makes only a finite number of updates at each iteration, one cannot guarantee that the Hessian approximations have eigenvalues that are uniformly bounded above and away from zero. To establish convergence of the BFGS method in the nonconvex case cautious updating procedures have been proposed li2001global . Here we employ a cautious strategy that is well suited to our particular algorithm; we skip the update, i.e., set , if the curvature condition

 yTksk≥ϵ∥sk∥2 (3.5)

is not satisfied, where is a predetermined constant. Using said mechanism we show that the eigenvalues of the Hessian matrix approximations generated by the multi-batch L-BFGS method are bounded above and away from zero (Lemma 3.3). The analysis presented in this section is based on the following assumptions.

Assumptions B.

1. [leftmargin=0.5cm,topsep=0pt,itemsep=0ex,,partopsep=0ex,parsep=0ex]

2. is twice continuously differentiable.

3. The gradients of are -Lipschitz continuous, and the gradients of are -Lipschitz continuous for all and all sets .

4. The function is bounded below by a scalar .

5. There exist constants and such that for all and all sets .

6. The samples are drawn independently and is an unbiased estimator of the true gradient for all , i.e.,

###### Lemma 3.3.

Suppose that Assumptions B.1-B.2 hold and let be given. Let be the Hessian approximations generated by Algorithm 1, with the modification that whenever (3.5) is not satisfied. Then, there exist constants such that

 μ1I⪯Hk⪯μ2I,for k=0,1,2,…

We can now follow the analysis in (bottou2016optimization, , Chapter  4) to establish the following result about the behavior of the gradient norm for the multi-batch L-BFGS method with a cautious update strategy.

###### Theorem 3.4.

Suppose that Assumptions B.1-B.5 above hold, and let be given. Let be the iterates generated by Algorithm 1, with , starting from , and with the modification that whenever (3.5) is not satisfied. Then,

 E[1LL−1∑k=0∥∇F(wk)∥2] ≤αμ22γ2Λμ1+2[F(w0)−ˆF]αμ1L L→∞−−−→αμ22γ2Λμ1.

This result bounds the average norm of the gradient of after the first iterations, and shows that the iterates spend increasingly more time in regions where the objective function has a small gradient.

## 4 Numerical Results

In this Section, we present numerical results that evaluate the proposed robust multi-batch L-BFGS scheme (Algorithm 1

) on logistic regression problems. Figure

2 shows the performance on the webspam dataset, where we compare it against three methods: (i) multi-batch L-BFGS without enforcing sample consistency (L-BFGS), where gradient differences are computed using different samples, i.e., ; (ii) multi-batch gradient descent (Gradient Descent), which is obtained by setting in Algorithm 1; and, (iii) serial SGD, where at every iteration one sample is used to compute the gradient. We run each method with 10 different random seeds, and, where applicable, report results for different batch () and overlap () sizes. The proposed method is more stable than the standard L-BFGS method; this is especially noticeable when is small. On the other hand, serial SGD achieves similar accuracy as the robust L-BFGS method and at a similar rate (e.g., ), at the cost of communications per epochs versus communications per epoch. Figure 2 also indicates that the robust L-BFGS method is not too sensitive to the size of overlap. Similar behavior was observed on other datasets, in regimes where was not too small; see Appendix B.1. We mention in passing that the L-BFGS step was computed using the a vector-free implementation proposed in chen2014large .

We also explore the performance of the robust multi-batch L-BFGS method in the presence of node failures (faults), and compare it to the multi-batch variant that does not enforce sample consistency (L-BFGS). Figure 3

illustrates the performance of the methods on the webspam dataset, for various probabilities of node failures

, and suggests that the robust L-BFGS variant is more stable; see Appendix B.2 for further results.

Lastly, we study the strong and weak scaling properties of the robust L-BFGS method on artificial data (Figure 4). We measure the time needed to compute a gradient (Gradient) and the associated communication (Gradient+C), as well as, the time needed to compute the L-BFGS direction (L-BFGS) and the associated communication (L-BFGS+C), for various batch sizes (). The figure on the left shows strong scaling of multi-batch LBFGS on a dimensional problem with samples. The size of input data is 24GB, and we vary the number of MPI processes, . The time it takes to compute the gradient decreases with , however, for small values of , the communication time exceeds the compute time. The figure on the right shows weak scaling on a problem of similar size, but with varying sparsity. Each sample has non-zero elements, thus for any the size of local problem is roughly GB (for size of data 192GB). We observe almost constant time for the gradient computation while the cost of computing the L-BFGS direction decreases with ; however, if communication is considered, the overall time needed to compute the L-BFGS direction increases slightly. For more details see Appendix C.

## 5 Conclusion

This paper describes a novel variant of the L-BFGS method that is robust and efficient in two settings. The first occurs in the presence of node failures in a distributed computing implementation; the second arises when one wishes to employ a different batch at each iteration in order to accelerate learning. The proposed method avoids the pitfalls of using inconsistent gradient differences by performing quasi-Newton updating based on the overlap between consecutive samples. Numerical results show that the method is efficient in practice, and a convergence analysis illustrates its theoretical properties.

#### Acknowledgements

The first two authors were supported by the Office of Naval Research award N000141410313, the Department of Energy grant DE-FG02-87ER25047 and the National Science Foundation grant DMS-1620022. Martin Takáč was supported by National Science Foundation grant CCF-1618717.

## References

• [1] A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111–1133, 2014.
• [2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989.
• [3] R. Bollapragada, R. Byrd, and J. Nocedal. Exact and inexact subsampled newton methods for optimization. arXiv preprint arXiv:1609.08502, 2016.
• [4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.
• [5] L. Bottou and Y. LeCun. Large scale online learning. In NIPS, pages 217–224, 2004.
• [6] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In NIPS, pages 161–168, 2008.
• [7] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
• [8] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
• [9] W. Chen, Z. Wang, and J. Zhou. Large-scale L-BFGS using MapReduce. In NIPS, pages 1332–1340, 2014.
• [10] Y.-H. Dai. Convergence properties of the BFGS algoritm. SIAM Journal on Optimization, 13(3):693–701, 2002.
• [11] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In NIPS, pages 1223–1231, 2012.
• [12] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
• [13] I. Goodfellow, Y. Bengio, and A. Courville. Book in preparation for MIT Press, 2016.
• [14] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013.
• [15] D.-H. Li and M. Fukushima. On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 11(4):1054–1064, 2001.
• [16] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.
• [17] W. F. Mascarenhas. The BFGS method with exact line searches fails for non-convex objective functions. Mathematical Programming, 99(1):49–61, 2004.
• [18] A. Nedić and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic Optimization: Algorithms and Applications, pages 223–264. Springer, 2001.
• [19] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng. On optimization methods for deep learning. In ICML, pages 265–272, 2011.
• [20] J. Nocedal and S. Wright. Numerical Optimization. Springer New York, 2 edition, 1999.
• [21] M. J. Powell. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. Nonlinear programming, 9(1):53–72, 1976.
• [22] B. Recht, C. Re, S. Wright, and F. Niu. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693–701, 2011.
• [23] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
• [24] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, page 1–30, 2016.
• [25] N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-Newton method for online convex optimization. In AISTATS, pages 436–443, 2007.
• [26] M. Takáč, A. Bijral, P. Richtárik, and N. Srebro. Mini-batch primal and dual methods for SVMs. In ICML, pages 1022–1030, 2013.
• [27] Y. Zhang and X. Lin. DiSCO: Distributed optimization for self-concordant empirical loss. In ICML, pages 362–370, 2015.

## Appendix A Proofs and Technical Results

### a.1 Assumptions

We first restate the Assumptions that we use in the Convergence Analysis section (Section 3). Assumption and are used in the strongly convex and nonconvex cases, respectively.

##### Assumptions A

1. is twice continuously differentiable.

2. There exist positive constants and such that

 ^λI⪯∇2FO(w)⪯^ΛI, (A.6)

for all and all sets .

3. There is a constant such that

 ES[∥∇FS(w)∥]2≤γ2, (A.7)

for all and all batches .

4. The samples are drawn independently and is an unbiased estimator of the true gradient for all , i.e.,

 E[∇FS(w)]=∇F(w). (A.8)

Note that Assumption implies that the entire Hessian also satisfies

 λI⪯∇2F(w)⪯ΛI,∀w∈Rd, (A.9)

for some constants .

##### Assumptions B

1. is twice continuously differentiable.

2. The gradients of are -Lipschitz continuous and the gradients of are -Lipschitz continuous for all and all sets .

3. The function is bounded below by a scalar .

4. There exist constants and such that

 ES[∥∇FS(w)∥]2≤γ2+η∥∇F(w)∥2, (A.10)

for all and all batches .

5. The samples are drawn independently and is an unbiased estimator of the true gradient for all , i.e.,

 E[∇FS(w)]=∇F(w). (A.11)

### a.2 Proof of Lemma 3.1 (Strongly Convex Case)

The following Lemma shows that the eigenvalues of the matrices generated by the multi-batch L-BFGS method are bounded above and away from zero if is strongly convex.

###### Lemma 3.1.

If Assumptions A.1-A.2 above hold, there exist constants such that the Hessian approximations generated by the multi-batch L-BFGS method (Algorithm 1) satisfy

 μ1I⪯Hk⪯μ2I,for k=0,1,2,…
###### Proof.

Instead of analyzing the inverse Hessian approximation , we study the direct Hessian approximation . In this case, the limited memory quasi-Newton updating formula is given as follows

1. Set and ; where is the memory in L-BFGS.

2. For set and compute

 B(i+1)k=B(i)k−B(i)ksjsTjB(i)ksTjB(i)ksj+yjyTjyTjsj.
3. Set

The curvature pairs and are updated via the following formulae

 yk+1=gOkk+1−gOkk,sk=wk+1−wk. (A.12)

A consequence of Assumption is that the eigenvalues of any sub-sampled Hessian ( samples) are bounded above and away from zero. Utilizing this fact, the convexity of component functions and the definitions (A.12), we have

 yTksk≥1^Λ∥yk∥2 ⇒∥yk∥2yTksk≤^Λ. (A.13)

On the other hand, strong convexity of the sub-sampled functions, the consequence of Assumption and definitions (A.12), provide a lower bound,

 yTksk≤1^λ∥yk∥2 ⇒∥yk∥2yTksk≥^λ. (A.14)

Combining the upper and lower bounds (A.13) and (A.14)

 ^λ≤∥yk∥2yTksk≤^Λ. (A.15)

The above proves that the eigenvalues of the matrices at the start of the L-BFGS update cycles are bounded above and away from zero, for all . We now use a Trace-Determinant argument to show that the eigenvalues of are bounded above and away from zero.

Let and denote the trace and determinant of matrix , respectively, and set . The trace of the matrix can be expressed as,

 Tr(Bk+1) =Tr(B(0)k)−Tr~m∑i=1(B(i)ksjisTjiB(i)ksTjiB(i)ksji)+Tr~m∑i=1yjiyTjiyTjisji ≤Tr(B(0)k)+~m∑i=1∥yji∥2yTjisji ≤Tr(B(0)k)+~m^Λ ≤C1, (A.16)

for some positive constant , where the inequalities above are due to (A.15), and the fact that the eigenvalues of the initial L-BFGS matrix are bounded above and away from zero.

Using a result due to Powell [21], the determinant of the matrix generated by the multi-batch L-BFGS method can be expressed as,

 det(Bk+1) =det(B(0)k)~m∏i=1yTjisjisTjiB(i−1)ksji =det(B(0)k)~m∏i=1yTjisjisTjisjisTjisjisTjiB(i−1)ksji ≥det(B(0)k)(^λC1)~m ≥C2, (A.17)

for some positive constant , where the above inequalities are due to the fact that the largest eigenvalue of is less than and Assumption .

The trace (A.2) and determinant (A.2) inequalities derived above imply that largest eigenvalues of all matrices are bounded above, uniformly, and that the smallest eigenvalues of all matrices are bounded away from zero, uniformly. ∎

### a.3 Proof of Theorem 3.2 (Strongly Convex Case)

Utilizing the result from Lemma 3.1, we now prove a linear convergence result to a neighborhood of the optimal solution, for the case where Assumptions hold.

###### Theorem 3.2.

Suppose that Assumptions A.1-A.4 above hold, and let , where is the minimizer of . Let be the iterates generated by the multi-batch L-BFGS method (Algorithm 1) with

 αk=α∈(0,12μ1λ),

starting from . Then for all ,

 E[F(wk)−F⋆] ≤(1−2αμ1λ)k[F(w0)−F⋆]+[1−(1−αμ1λ)k]αμ22γ2Λ4μ1λ k→∞−−−→αμ22γ2Λ4μ1λ.
###### Proof.

We have that

 F(wk+1) =F(wk−αHk∇FSk(wk)) ≤F(wk)+∇F(wk)T(−αHk∇FSk(wk))+Λ2∥αHk∇FSk(wk)∥2 ≤F(wk)−α∇F(wk)THk∇FSk(wk)+α2μ22Λ2∥∇FSk(wk)∥2, (A.18)

where the first inequality arises due to (A.9), and the second inequality arises as a consequence of Lemma 3.1.

Taking the expectation (over ) of equation (A.3)

 ESk[F(wk+1)] ≤F(wk)−α∇F(wk)THk∇F(wk)+α2μ22Λ2ESk[∥∇FSk(wk)∥]2 ≤F(wk)−αμ1∥∇F(wk)∥2+α2μ22γ2Λ2, (A.19)

where in the first inequality we make use of Assumption , and the second inequality arises due to Lemma 3.1 and Assumption .

Since is -strongly convex, we can use the following relationship between the norm of the gradient squared, and the distance of the -th iterate from the optimal solution.

 2λ[F(wk)−F⋆]≤∥∇F(wk)∥2.

Using the above with (A.3),

 ESk[F(wk+1)] ≤F(wk)−αμ1∥∇F(wk)∥2+α2μ22γ2Λ2 ≤F(wk)−2αμ1λ[F(wk)−F⋆]+α2μ22γ2Λ2. (A.20)

Let

 ϕk=E[F(wk)−F⋆], (A.21)

where the expectation is over all batches and all history starting with . Thus (A.3) can be expressed as,

 ϕk+1≤(1−2αμ1λ)ϕk+α2μ22γ2Λ2, (A.22)

from which we deduce that in order to reduce the value with respect to the previous function value, the step length needs to be in the range

 α∈(0,12μ1λ).

Subtracting from either side of (A.22) yields

 ϕk+1−αμ22γ2Λ4μ1λ ≤(1−2αμ1λ)ϕk+α2μ22γ2Λ2−αμ22γ2Λ4μ1λ =(1−2αμ1λ)[ϕk−αμ22γ2Λ4μ1λ]. (A.23)

Recursive application of (A.3) yields

 ϕk−αμ22γ2Λ4μ1λ ≤(1−2αμ1λ)k[ϕ0−αμ22γ2Λ4μ1λ],

and thus,

 ϕk≤(1−2αμ1λ)kϕ0+[1−(1−αμ1λ)k]αμ22γ2Λ4μ1λ. (A.24)

Finally using the definition of (A.21) with the above expression yields the desired result,

 E[F(wk)−F⋆]≤(1−2αμ1λ)k[F(w0)−F⋆]+[1−(1−αμ1λ)k]αμ22γ2Λ4μ1λ.\qed

### a.4 Proof of Lemma 3.3 (Nonconvex Case)

The following Lemma shows that the eigenvalues of the matrices generated by the multi-batch L-BFGS method are bounded above and away from zero (nonconvex case).

###### Lemma 3.3.

Suppose that Assumptions B.1-B.2 hold and let be given. Let be the Hessian approximations generated by the multi-batch L-BFGS method (Algorithm 1), with the modification that the Hessian approximation update is performed only when

 yTksk≥ϵ∥sk∥2, (A.25)

else . Then, there exist constants such that

 μ1I⪯Hk⪯μ2I,for k=0,1,2,…
###### Proof.

Similar to the proof of Lemma 3.1, we study the direct Hessian approximation .

The curvature pairs and are updated via the following formulae

 yk+1=gOkk+1−gOkk,sk=wk+1−wk. (A.26)

The skipping mechanism (A.25) provides both an upper and lower bound on the quantity , which in turn ensures that the initial L-BFGS Hessian approximation is bounded above and away from zero. The lower bound is attained by repeated application of Cauchy’s inequality to condition (A.25). We have from (A.25) that

 ϵ∥sk∥2 ≤yTksk≤∥yk∥∥sk∥,

and therefore

 ∥sk∥≤1ϵ∥yk∥.

It follows that

 sTkyk≤∥sk∥∥yk∥≤1ϵ∥yk∥2

and hence

 ∥yk∥2sTkyk≥ϵ. (A.27)

The upper bound is attained by the Lipschitz continuity of sample gradients,

 yTksk ≥ϵ∥sk∥2 ≥ϵ∥yk∥2Λ2Ok,

Re-arranging the above expression yields the desired upper bound,

 ∥yk∥2sTkyk≤Λ2Okϵ. (A.28)

Combining (A.27) and (A.28),

 ϵ≤∥yk∥2yTksk≤Λ2Okϵ.

The above proves that the eigenvalues of the matrices at the start of the L-BFGS update cycles are bounded above and away from zero, for all . The rest of the proof follows the same trace-determinant argument as in the proof of Lemma 3.1, the only difference being that the last inequality in A.2 comes as a result of the cautious update strategy. ∎

### a.5 Proof of Theorem 3.4 (Nonconvex Case)

Utilizing the result from Lemma 3.3, we can now establish the following result about the behavior of the gradient norm for the multi-batch L-BFGS method with a cautious update strategy.

###### Theorem 3.4.

Suppose that Assumptions B.1-B.5 above hold. Let be the iterates generated by the multi-batch L-BFGS method (Algorithm 1) with

 αk=α∈(0,μ1μ22ηΛ),

where is the starting point. Also, suppose that if

 yTksk<ϵ∥sk∥2,

for some , the inverse L-BFGS Hessian approximation is skipped, . Then, for all ,

 E[1LL−1∑k=0∥∇F(wk)∥2] ≤αμ22γ2Λμ1+2[F(w0)−ˆF]αμ1L L→∞−−−→αμ22γ2Λμ1.
###### Proof.

Starting with (A.3),

 ESk[F(wk+1)] ≤F(wk)−αμ1∥∇F(wk)∥2+α2μ22Λ2