1 Introduction
The LBFGS method (Liu & Nocedal, 1989) has traditionally been regarded as a batch method in the machine learning community. This is because quasiNewton algorithms need gradients of high quality in order to construct useful quadratic models and perform reliable line searches. These algorithmic ingredients can be implemented, it seems, only by using very large batch sizes, resulting in a costly iteration that makes the overall algorithm slow compared with stochastic gradient methods (Robbins & Monro, 1951).
Even before the resurgence of neural networks, many researchers observed that a welltuned implementation of the stochastic gradient (SG) method was far more effective on largescale logistic regression applications than the batch LBFGS method, even when taking into account the advantages of parallelism offered by the use of large batches. The preeminence of the SG method (and its variants) became more pronounced with the advent of deep neural networks, and some researchers have speculated that SG is endowed with certain regularization properties that are essential in the minimization of such complex nonconvex functions (Hardt et al., 2015; Keskar et al., 2016).
In this paper, we postulate that the most efficient algorithms for machine learning may not reside entirely in the highly stochastic or full batch regimes, but should employ a progressive batching approach in which the sample size is initially small, and is increased as the iteration progresses. This view is consistent with recent numerical experiments on training various deep neural networks (Smith et al., 2017; Goyal et al., 2017), where the SG method, with increasing sample sizes, yields similar test loss and accuracy as the standard (fixed minibatch) SG method, while offering significantly greater opportunities for parallelism.
Progressive batching algorithms have received much attention recently from a theoretical perspective. It has been shown that they enjoy complexity bounds that rival those of the SG method (Byrd et al., 2012), and that they can achieve a fast rate of convergence (Friedlander & Schmidt, 2012). The main appeal of these methods is that they inherit the efficient initial behavior of the SG method, offer greater opportunities to exploit parallelism, and allow for the incorporation of secondorder information. The latter can be done efficiently via quasiNewton updating.
An integral part of quasiNewton methods is the line search, which ensures that a convex quadratic model can be constructed at every iteration. One challenge that immediately arises is how to perform this line search when the objective function is stochastic. This is an issue that has not received sufficient attention in the literature, where stochastic line searches have been largely dismissed as inappropriate. In this paper, we take a step towards the development of stochastic line searches for machine learning by studying a key component, namely the initial estimate in the onedimensional search. Our approach, which is based on statistical considerations, is designed for an Armijostyle backtracking line search.
1.1 Literature Review
Progressive batching (sometimes referred to as dynamic sampling) has been well studied in the optimization literature, both for stochastic gradient and subsampled Newtontype methods (Byrd et al., 2012; Friedlander & Schmidt, 2012; Cartis & Scheinberg, 2015; Pasupathy et al., 2015; RoostaKhorasani & Mahoney, 2016a, b; Bollapragada et al., 2016, 2017; De et al., 2017). Friedlander and Schmidt (2012) introduced theoretical conditions under which a progressive batching SG method converges linearly for finite sum problems, and experimented with a quasiNewton adaptation of their algorithm. Byrd et al. (2012) proposed a progressive batching strategy, based on a norm test, that determines when to increase the sample size; they established linear convergence and computational complexity bounds in the case when the batch size grows geometrically. More recently, Bollapragada et al. (2017) introduced a batch control mechanism based on an inner product test that improves upon the norm test mentioned above.
There has been a renewed interest in understanding the generalization properties of smallbatch and largebatch methods for training neural networks; see (Keskar et al., 2016; Dinh et al., 2017; Goyal et al., 2017; Hoffer et al., 2017). Keskar et al. (2016) empirically observed that largebatch methods converge to solutions with inferior generalization properties; however, Goyal et al. (2017) showed that largebatch methods can match the performance of smallbatch methods when a warmup strategy is used in conjunction with scaling the step length by the same factor as the batch size. Hoffer et al. (2017) and You et al. (2017) also explored larger batch sizes and steplengths to reduce the number of updates necessary to train the network. All of these studies naturally led to an interest in progressive batching techniques. Smith et al. (2017) showed empirically that increasing the sample size and decaying the steplength are quantitatively equivalent for the SG method; hence, steplength schedules could be directly converted to batch size schedules. This approach was parallelized by Devarakonda et al. (2017). De et al. (2017) presented numerical results with a progressive batching method that employs the norm test. Balles et al. (2016) proposed an adaptive dynamic sample size scheme and couples the sample size with the steplength.
Stochastic secondorder methods have been explored within the context of convex and nonconvex optimization; see (Schraudolph et al., 2007; SohlDickstein et al., 2014; Mokhtari & Ribeiro, 2015; Berahas et al., 2016; Byrd et al., 2016; Keskar & Berahas, 2016; Curtis, 2016; Berahas & Takáč, 2017; Zhou et al., 2017). Schraudolph et al. (2007) ensured stability of quasiNewton updating by computing gradients using the same batch at the beginning and end of the iteration. Since this can potentially double the cost of the iteration, Berahas et al. (2016) proposed to achieve gradient consistency by computing gradients based on the overlap between consecutive batches; this approach was further tested by Berahas and Takac (2017). An interesting approach introduced by Martens and Grosse (2015; 2016) approximates the Fisher information matrix to scale the gradient; a distributed implementation of their KFAC approach is described in (Ba et al., 2016). Another approach approximately computes the inverse Hessian by using the Neumann power series representation of matrices (Krishnan et al., 2017).
1.2 Contributions
This paper builds upon three algorithmic components that have recently received attention in the literature — progressive batching, stable quasiNewton updating, and adaptive steplength selection. It advances their design and puts them together in a novel algorithm with attractive theoretical and computational properties.
The cornerstone of our progressive batching strategy is the mechanism proposed by Bollapragada et al. (2017) in the context of firstorder methods. We extend their inner product control test to secondorder algorithms, something that is delicate and leads to a significant modification of the original procedure. Another main contribution of the paper is the design of an Armijostyle backtracking line search where the initial steplength is chosen based on statistical information gathered during the course of the iteration. We show that this steplength procedure is effective on a wide range of applications, as it leads to well scaled steps and allows for the BFGS update to be performed most of the time, even for nonconvex problems. We also test two techniques for ensuring the stability of quasiNewton updating, and observe that the overlapping procedure described by Berahas et al. (2016) is more efficient than a straightforward adaptation of classical quasiNewton methods (Schraudolph et al., 2007).
We report numerical tests on largescale logistic regression and deep neural network training tasks that indicate that our method is robust and efficient, and has good generalization properties. An additional advantage is that the method requires almost no parameter tuning, which is possible due to the incorporation of secondorder information. All of this suggests that our approach has the potential to become one of the leading optimization methods for training deep neural networks. In order to achieve this, the algorithm must be optimized for parallel execution, something that was only briefly explored in this study.
2 A Progressive Batching QuasiNewton Method
The problem of interest is
(1) 
where is the composition of a prediction function (parametrized by
) and a loss function, and
are random inputoutput pairs with probability distribution
. The associated empirical risk problem consists of minimizingwhere we define A stochastic quasiNewton method is given by
(2) 
where the batch (or subsampled) gradient is given by
(3) 
the set indexes data points sampled from the distribution , and is a positive definite quasiNewton matrix. We now discuss each of the components of the new method.
2.1 Sample Size Selection
The proposed algorithm has the form (29)(30). Initially, it utilizes a small batch size , and increases it gradually in order to attain a fast local rate of convergence and permit the use of secondorder information. A challenging question is to determine when, and by how much, to increase the batch size over the course of the optimization procedure based on observed gradients — as opposed to using prescribed rules that depend on the iteration number .
We propose to build upon the strategy introduced by Bollapragada et al. (2017) in the context of firstorder methods. Their inner product test determines a sample size such that the search direction is a descent direction with high probability. A straightforward extension of this strategy to the quasiNewton setting is not appropriate since requiring only that a stochastic quasiNewton search direction be a descent direction with high probability would underutilize the curvature information contained in the search direction.
We would like, instead, for the search direction to make an acute angle with the true quasiNewton search direction , with high probability. Although this does not imply that is a descent direction for , this will normally be the case for any reasonable quasiNewton matrix.
To derive the new inner product quasiNewton (IPQN) test, we first observe that the stochastic quasiNewton search direction makes an acute angle with the true quasiNewton direction in expectation, i.e.,
(4) 
where denotes the conditional expectation at
. We must, however, control the variance of this quantity to achieve our stated objective. Specifically, we select the sample size
such that the following condition is satisfied:(5)  
for some . The left hand side of (5) is difficult to compute but can be bounded by the true variance of individual search directions, i.e.,
(6)  
where . This test involves the true expected gradient and variance, but we can approximate these quantities with sample gradient and variance estimates, respectively, yielding the practical inner product quasiNewton test:
(7) 
where is a subset of the current sample (batch), and the variance term is defined as
(8) 
The variance (8
) may be computed using just one additional Hessian vector product of
with . Whenever condition (7) is not satisfied, we increase the sample size . In order to estimate the increase that would lead to a satisfaction of (7), we reason as follows. If we assume that new sample is such thatand similarly for the variance estimate, then a simple computation shows that a lower bound on the new sample size is
(9) 
In our implementation of the algorithm, we set the new sample size as . When the sample approximation of is not accurate, which can occur when is small, the progressive batching mechanism just described may not be reliable. In this case we employ the moving window technique described in Section 4.2 of Bollapragada et al. (2017), to produce a sample estimate of .
2.2 The Line Search
In deterministic optimization, line searches are employed to ensure that the step is not too short and to guarantee sufficient decrease in the objective function. Line searches are particularly important in quasiNewton methods since they ensure robustness and efficiency of the iteration with little additional cost.
In contrast, stochastic line searches are poorly understood and rarely employed in practice because they must make decisions based on sample function values
(10) 
which are noisy approximations to the true objective . One of the key questions in the design of a stochastic line search is how to ensure, with high probability, that there is a decrease in the true function when one can only observe stochastic approximations . We address this question by proposing a formula for the step size that controls possible increases in the true function. Specifically, the first trial steplength in the stochastic backtracking line search is computed so that the predicted decrease in the expected function value is sufficiently large, as we now explain.
Using Lipschitz continuity of and taking conditional expectation, we can show the following inequality
(11) 
where
, , and is the Lipschitz constant. The proof of (A.1) is given in the supplement.
The only difference in (A.1) between the deterministic and stochastic quasiNewton methods is the additional variance term in the matrix . To obtain decrease in the function value in the deterministic case, the matrix must be positive definite, whereas in the stochastic case the matrix must be positive definite to yield a decrease in in expectation. In the deterministic case, for a reasonably good quasiNewton matrix , one expects that will result in a decrease in the function, and therefore the initial trial steplength parameter should be chosen to be 1. In the stochastic case, the initial trial value
(12) 
will result in decrease in the expected function value. However, since formula (12) involves the expensive computation of the individual matrixvector products , we approximate the variancebias ratio as follows:
(13) 
where . In our practical implementation, we estimate the population variance and gradient with the sample variance and gradient, respectively, yielding the initial steplength
(14) 
where
(15) 
and . With this initial value of in hand, our algorithm performs a backtracking line search that aims to satisfy the Armijo condition
(16)  
where .
2.3 Stable QuasiNewton Updates
In the BFGS and LBFGS methods, the inverse Hessian approximation is updated using the formula
(17)  
where and is the difference in the gradients at and . When the batch changes from one iteration to the next (), it is not obvious how should be defined. It has been observed that when is computed using different samples, the updating process may be unstable, and hence it seems natural to use the same sample at the beginning and at the end of the iteration (Schraudolph et al., 2007), and define
(18) 
However, this requires that the gradient be evaluated twice for every batch at and . To avoid this additional cost, Berahas et al. (2016) propose to use the overlap between consecutive samples in the gradient differencing. If we denote this overlap as , then one defines
(19) 
This requires no extra computation since the two gradients in this expression are subsets of the gradients corresponding to the samples and . The overlap should not be too small to avoid differencing noise, but this is easily achieved in practice. We test both formulas for in our implementation of the method; see Section 4.
2.4 The Complete Algorithm
The pseudocode of the progressive batching LBFGS method is given in Algorithm 1. Observe that the limited memory Hessian approximation in Line 8 is independent of the choice of the sample . Specifically, is defined by a collection of curvature pairs , where the most recent pair is based on the sample ; see Line 14. For the batch size control test (7), we choose in the logistic regression experiments, and is a tunable parameter chosen in the interval in the neural network experiments. The constant in (16) is set to . For LBFGS, we set the memory as . We skip the quasiNewton update if the following curvature condition is not satisfied:
(20) 
The initial Hessian matrix in the LBFGS recursion at each iteration is chosen as where .
3 Convergence Analysis
We now present convergence results for the proposed algorithm, both for strongly convex and nonconvex objective functions. Our emphasis is in analyzing the effect of progressive sampling, and therefore, we follow common practice and assume that the steplength in the algorithm is fixed (), and that the inverse LBFGS matrix
has bounded eigenvalues, i.e.,
(21) 
This assumption can be justified both in the convex and nonconvex cases under certain conditions; see (Berahas et al., 2016). We assume that the sample size is controlled by the exact inner product quasiNewton test (31). This test is designed for efficiency, and in rare situations could allow for the generation of arbitrarily long search directions. To prevent this from happening, we introduce an additional control on the sample size , by extending (to the quasiNewton setting) the orthogonality test introduced in (Bollapragada et al., 2017). This additional requirement states that the current sample size is acceptable only if
(22) 
for some given .
We now establish linear convergence when the objective is strongly convex.
Theorem 3.1.
Suppose that is twice continuously differentiable and that there exist constants such that
(23) 
Let be generated by iteration (29), for any , where is chosen by the (exact variance) inner product quasiNewton test (31). Suppose that the orthogonality condition (32) holds at every iteration, and that the matrices satisfy (B.2). Then, if
(24) 
we have that
(25) 
where denotes the minimizer of , and denotes the total expectation.
The proof of this result is given in the supplement. We now consider the case when is nonconvex and bounded below.
Theorem 3.2.
Suppose that is twice continuously differentiable and bounded below, and that there exists a constant such that
(26) 
Let be generated by iteration (29), for any , where is chosen so that (31) and (32) are satisfied, and suppose that (B.2) holds. Then, if satisfies (36), we have
(27) 
Moreover, for any positive integer we have that
where is a lower bound on in .
The proof is given in the supplement. This result shows that the sequence of gradients converges to zero in expectation, and establishes a global sublinear rate of convergence of the smallest gradients generated after every steps.
4 Numerical Results
In this section, we present numerical results for the proposed algorithm, which we refer to as PBQN for the Progressive Batching QuasiNewton algorithm.
4.1 Experiments on Logistic Regression Problems
We first test our algorithm on binary classification problems where the objective function is given by the logistic loss with regularization:
(28) 
with . We consider the datasets listed in the supplement. An approximation of the optimal function value is computed for each problem by running the full batch LBFGS method until . Training error is defined as , where is evaluated over the training set; test loss is evaluated over the test set without the regularization term.
We tested two options for computing the curvature vector in the PBQN method: the multibatch (MB) approach (19) with 25% sample overlap, and the full overlap (FO) approach (18). We set in (7), chose , and set all other parameters to the default values given in Section 2. Thus, none of the parameters in our PBQN method were tuned for each individual dataset. We compared our algorithm against two other methods: (i) Stochastic gradient (SG) with a batch size of 1; (ii) SVRG (Johnson & Zhang, 2013) with the inner loop length set to . The steplength for SG and SVRG is constant and tuned for each problem (, for ) so as to give best performance.
In Figures 9 and 2 we present results for two datasets, spam and covertype; the rest of the results are given in the supplement. The horizontal axis measures the number of full gradient evaluations, or equivalently, the number of times that component gradients were evaluated. The leftmost figure reports the long term trend over 100 gradient evaluations, while the rest of the figures zoom into the first 10 gradient evaluations to show the initial behavior of the methods. The vertical axis measures training error, test loss, and test accuracy, respectively, from left to right.
The proposed algorithm competes well for these two datasets in terms of training error, test loss and test accuracy, and decreases these measures more evenly than the SG and SVRG. Our numerical experience indicates that formula (14) is quite effective at estimating the steplength parameter, as it is accepted by the backtracking line search for most iterations. As a result, the line search computes very few additional function values.
It is interesting to note that SVRG is not as efficient in the initial epochs compared to PBQN or SG, when measured either in terms of test loss and test accuracy. The training error for SVRG decreases rapidly in later epochs but this rapid improvement is not observed in the test loss and accuracy. Neither the PBQN nor SVRG significantly outperforms the other across all datasets tested in terms of training error, as observed in the supplement.
Our results indicate that defining the curvature vector using the MB approach is preferable to using the FB approach. The number of iterations required by the PBQN method is significantly smaller compared to the SG method, suggesting the potential efficiency gains of a parallel implementation of our algorithm.
4.2 Results on Neural Networks
We have performed a preliminary investigation into the performance of the PBQN algorithm for training neural networks. As is wellknown, the resulting optimization problems are quite difficult due to the existence of local minimizers, some of which generalize poorly. Thus our first requirement when applying the PBQN method was to obtain as good generalization as SG, something we have achieved.
Our investigation into how to obtain fast performance is, however, still underway for reasons discussed below. Nevertheless, our results are worth reporting because they show that our line search procedure is performing as expected, and that the overall number of iterations required by the PBQN method is small enough so that a parallel implementation could yield stateoftheart results, based on the theoretical performance model detailed in the supplement.
We compared our algorithm, as described in Section 2, against SG and Adam (Kingma & Ba, 2014)
. It has taken many years to design regularizations techniques and heuristics that greatly improve the performance of the SG method for deep learning
(Srivastava et al., 2014; Ioffe & Szegedy, 2015). These include batch normalization and dropout, which (in their current form) are not conducive to the PBQN approach due to the need for gradient consistency when evaluating the curvature pairs in LBFGS. Therefore, we do not implement batch normalization and dropout in any of the methods tested, and leave the study of their extension to the PBQN setting for future work.
We consider three network architectures: (i) a small convolutional neural network on CIFAR10 (
) (Krizhevsky, 2009), (ii) an AlexNetlike convolutional network on MNIST and CIFAR10 (, respectively) (LeCun et al., 1998; Krizhevsky et al., 2012), and (iii) a residual network (ResNet18) on CIFAR10 () (He et al., 2016). The network architecture details and additional plots are given in the supplement. All of these networks were implemented in PyTorch
(Paszke et al., 2017). The results for the CIFAR10 AlexNet and CIFAR10 ResNet18 are given in Figures 15 and 16, respectively. We report results both against the total number of iterations and the total number of gradient evaluations. Table 1 shows the best test accuracies attained by each of the four methods over the various networks.In all our experiments, we initialize the batch size as in the PBQN method, and fix the batch size to for SG and Adam. The parameter given in (7), which controls the batch size increase in the PBQN method, was tuned lightly by chosing among the 3 values: 0.9, 2, 3. SG and Adam are tuned using a developmentbased decay (devdecay) scheme, which track the best validation loss at each epoch and reduces the steplength by a constant factor if the validation loss does not improve after epochs.
Network  SG  Adam  MB  FO 

66.24  67.03  67.37  62.46  
99.25  99.34  99.16  99.05  
73.46  73.59  73.02  72.74  
69.5  70.16  70.28  69.44 
We observe from our results that the PBQN method achieves a similar test accuracy as SG and Adam, but requires more gradient evaluations. Improvements in performance can be obtained by ensuring that the PBQN method exerts a finer control on the sample size in the small batch regime — something that requires further investigation. Nevertheless, the small number of iterations required by the PBQN method, together with the fact that it employs larger batch sizes than SG during much of the run, suggests that a distributed version similar to a dataparallel distributed implementation of the SG method (Chen et al., 2016; Das et al., 2016) would lead to a highly competitive method.
Similar to the logistic regression case, we observe that the steplength computed via (14) is almost always accepted by the Armijo condition, and typically lies within . Once the algorithm has trained for a significant number of iterations using fullbatch, the algorithm begins to overfit on the training set, resulting in worsened test loss and accuracy, as observed in the graphs.
5 Final Remarks
Several types of quasiNewton methods have been proposed in the literature to address the challenges arising in machine learning. Some of these method operate in the purely stochastic setting (which makes quasiNewton updating difficult) or in the purely batch regime (which leads to generalization problems). We believe that progressive batching is the right context for designing an LBFGS method that has good generalization properties, does not expose any free parameters, and has fast convergence. The advantages of our approach are clearly seen in logistic regression experiments. To make the new method competitive with SG and Adam for deep learning, we need to improve several of its components. This includes the design of a more robust progressive batching mechanism, the redesign of batch normalization and dropout heuristics to improve the generalization performance of our method for training larger networks, and most importantly, the design of a parallelized implementation that takes advantage of the higher granularity of each iteration. We believe that the potential of the proposed approach as an alternative to SG for deep learning is worthy of further investigation.
Acknowledgements
We thank Albert Berahas for his insightful comments regarding multibatch LBFGS and probabilistic line searches, as well as for his useful feedback on earlier versions of the manuscript. We also thank the anonymous reviewers for their useful feedback. Bollapragada is supported by DOE award DEFG0287ER25047. Nocedal is supported by NSF award DMS1620070. Shi is supported by Intel grant SP0036122.
References
 Ba et al. (2016) Ba, J., Grosse, R., and Martens, J. Distributed secondorder optimization using kroneckerfactored approximations. 2016.
 Balles et al. (2016) Balles, L., Romero, J., and Hennig, P. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.
 Berahas & Takáč (2017) Berahas, A. S. and Takáč, M. A robust multibatch lbfgs method for machine learning. arXiv preprint arXiv:1707.08552, 2017.
 Berahas et al. (2016) Berahas, A. S., Nocedal, J., and Takác, M. A multibatch lbfgs method for machine learning. In Advances in Neural Information Processing Systems, pp. 1055–1063, 2016.
 Bertsekas et al. (2003) Bertsekas, D. P., Nedić, A., and Ozdaglar, A. E. Convex analysis and optimization. Athena Scientific Belmont, 2003.
 Bollapragada et al. (2016) Bollapragada, R., Byrd, R., and Nocedal, J. Exact and inexact subsampled Newton methods for optimization. arXiv preprint arXiv:1609.08502, 2016.
 Bollapragada et al. (2017) Bollapragada, R., Byrd, R., and Nocedal, J. Adaptive sampling strategies for stochastic optimization. arXiv preprint arXiv:1710.11258, 2017.
 Byrd et al. (2012) Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155, 2012.
 Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasiNewton method for largescale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
 Carbonetto (2009) Carbonetto, P. New probabilistic inference algorithms that harness the strengths of variational and Monte Carlo methods. PhD thesis, University of British Columbia, 2009.
 Cartis & Scheinberg (2015) Cartis, C. and Scheinberg, K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, pp. 1–39, 2015.

Chang & Lin (2011)
Chang, C. and Lin, C.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.  Chen et al. (2016) Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
 Cormack & Lynam (2005) Cormack, G. and Lynam, T. Spam corpus creation for TREC. In Proc. 2nd Conference on Email and AntiSpam, 2005. http://plg.uwaterloo.ca/˜gvcormac/treccorpus.
 Curtis (2016) Curtis, F. A selfcorrecting variablemetric algorithm for stochastic optimization. In International Conference on Machine Learning, pp. 632–641, 2016.
 Das et al. (2016) Das, D., Avancha, S., Mudigere, D., Vaidynathan, K., Sridharan, S., Kalamkar, D., Kaul, B., and Dubey, P. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709, 2016.
 De et al. (2017) De, S., Yadav, A., Jacobs, D., and Goldstein, T. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.
 Devarakonda et al. (2017) Devarakonda, A., Naumov, M., and Garland, M. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029, 2017.
 Dinh et al. (2017) Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
 Friedlander & Schmidt (2012) Friedlander, M. P. and Schmidt, M. Hybrid deterministicstochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Grosse & Martens (2016) Grosse, R. and Martens, J. A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pp. 573–582, 2016.
 Guyon et al. (2008) Guyon, I., Aliferis, C. F., Cooper, G. F., Elisseeff, A., Pellet, J., Spirtes, P., and Statnikov, A. R. Design and analysis of the causation and prediction challenge. In WCCI Causation and Prediction Challenge, pp. 1–33, 2008.
 Hardt et al. (2015) Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hoffer et al. (2017) Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456, 2015.

Johnson & Zhang (2013)
Johnson, R. and Zhang, T.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in Neural Information Processing Systems 26, pp. 315–323, 2013.  Keskar & Berahas (2016) Keskar, N. S. and Berahas, A. S. adaqn: An adaptive quasinewton algorithm for training rnns. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 1–16. Springer, 2016.
 Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krishnan et al. (2017) Krishnan, S., Xiao, Y., and Saurous, R. A. Neumann optimizer: A practical optimization algorithm for deep neural networks. arXiv preprint arXiv:1712.03298, 2017.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 Kurth et al. (2017) Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I., Patwary, M. M. A., Malas, T., Sundaram, N., Bhimji, W., Smorkalov, M., et al. Deep learning at 15pf: Supervised and semisupervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 7. ACM, 2017.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Liu & Nocedal (1989) Liu, D. C. and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(13):503–528, 1989.
 Martens & Grosse (2015) Martens, J. and Grosse, R. Optimizing neural networks with kroneckerfactored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.
 Mokhtari & Ribeiro (2015) Mokhtari, A. and Ribeiro, A. Global convergence of online limited memory bfgs. Journal of Machine Learning Research, 16(1):3151–3181, 2015.
 Nocedal & Wright (1999) Nocedal, J. and Wright, S. Numerical Optimization. Springer New York, 2 edition, 1999.
 Pasupathy et al. (2015) Pasupathy, R., Glynn, P., Ghosh, S., and Hashemi, F. S. On sampling rates in stochastic recursions. 2015. Under Review.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
 Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
 RoostaKhorasani & Mahoney (2016a) RoostaKhorasani, F. and Mahoney, M. W. Subsampled Newton methods II: Local convergence rates. arXiv preprint arXiv:1601.04738, 2016a.
 RoostaKhorasani & Mahoney (2016b) RoostaKhorasani, F. and Mahoney, M. W. Subsampled Newton methods I: Globally convergent algorithms. arXiv preprint arXiv:1601.04737, 2016b.
 Schraudolph et al. (2007) Schraudolph, N. N., Yu, J., and Günter, S. A stochastic quasinewton method for online convex optimization. In International Conference on Artificial Intelligence and Statistics, pp. 436–443, 2007.
 Smith et al. (2017) Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
 SohlDickstein et al. (2014) SohlDickstein, J., Poole, B., and Ganguli, S. Fast largescale optimization by unifying stochastic gradient and quasiNewton methods. In International Conference on Machine Learning, pp. 604–612, 2014.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 You et al. (2017) You, Y., Gitman, I., and Ginsburg, B. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.
 Zhou et al. (2017) Zhou, C., Gao, W., and Goldfarb, D. Stochastic adaptive quasiNewton methods for minimizing expected values. In International Conference on Machine Learning, pp. 4150–4159, 2017.
Appendix A Initial Step Length Derivation
To establish our results, recall that the stochastic quasiNewton method is defined as
(29) 
where the batch (or subsampled) gradient is given by
(30) 
and the set indexes data points . The algorithm selects the Hessian approximation through quasiNewton updating prior to selecting the new sample to define the search direction . We will use to denote the conditional expectation at and use to denote the total expectation.
The primary theoretical mechanism for determining batch sizes is the exact variance inner product quasiNewton (IPQN) test, which is defined as
(31) 
We establish the inequality used to determine the initial steplength for the stochastic line search.
Lemma A.1.
Assume that is continuously differentiable with Lipschitz continuous gradient with Lipschitz constant . Then
where
and .
Proof.
By Lipschitz continuity of the gradient, we have that
∎
Appendix B Convergence Analysis
For the rest of our analysis, we make the following two assumptions.
Assumptions B.1.
The orthogonality condition is satisfied for all , i.e.,
(32) 
for some large .
Assumptions B.2.
The eigenvalues of are contained in an interval in , i.e., for all there exist constants such that
(33) 
Condition (32) ensures that the stochastic quasiNewton direction is bounded away from orthogonality to , with high probability, and prevents the variance in the individual quasiNewton directions to be too large relative to the variance in the individual quasiNewton directions along . Assumption B.2 holds, for example, when is convex and a regularization parameter is included so that any subsampled Hessian is positive definite. It can also be shown to hold in the nonconvex case by applying cautious BFGS updating; e.g. by updating only when where is a predetermined constant (Berahas et al., 2016).
We begin by establishing a technical descent lemma.
Lemma B.3.
Suppose that is twice continuously differentiable and that there exists a constant such that
(34) 
Let be generated by iteration (29) for any , where is chosen by the (exact variance) inner product quasiNewton test (31) for given constant and suppose that assumptions (B.1) and (B.2) hold. Then, for any ,
(35) 
Moreover, if satisfies
(36) 
we have that
(37) 
Proof.
By Assumption (B.1), the orthogonality condition, we have that
(38)  
Now, expanding the left hand side of inequality (38), we get
Therefore, rearranging gives the inequality
(39) 
To bound the first term on the right side of this inequality, we use the inner product quasiNewton test; in particular, satisfies
(40) 
where the second inequality holds by the IPQN test. Since
(41) 
we have
(42) 
by (40) and (41). Substituting (42) into (39), we get the following bound on the length of the search direction:
which proves (35). Using this inequality, Assumption B.2, and bounds on the Hessian and steplength (34) and (36), we have
Comments
There are no comments yet.