1 Introduction
The problem of interest is minimizing for functions of the following form :
(1) 
common to problems in statistics and machine learning. For instance, in the empirical risk minimization framework, a model is learned from a set
of training data by minimizing an empirical loss function of the form
(2) 
where we define , and is the composition of a prediction function (parametrized by ) and a loss function, and
are random inputoutput pairs with the uniform discrete probability distribution
. An objective function of the form (1) is often impractical, as the distribution of is generally unavailable, making it infeasible to analytically compute. This can be resolved by replacing the expectation by the estimate (
2). The strong law of large numbers implies that the sample mean in (
2) converges almost surely to (1) as the number of samples increases. However, in practice, even problem (2) is not a tractable for classical optimization algorithms, as the amount of data is usually extremely large. A better strategy when optimizing (2) is to consider subsamples of the data to reduce the computational cost. This leads to stochastic algorithms where the objective function changes at each iteration by randomly selecting subsamples.2 Related work
Stochastic gradient descent SGD (vanillasgd), using either single or minibatch samples is widelyused; however, it offers several challenges in terms of convergence speed. Hence, the topic of learning rate tuning has been extensively investigated. Modified versions of SGD based on the use of momentum terms (Polyak64); (NESTEROV83)
have been showed to speedup the convergence for SGD on smooth convex functions. In DNNs, other popular variants of SGD scale the individual components of the stochastic gradient using past gradient observations in order to deal with variances in the magnitude in the stochastic gradient components (especially between layers). Among these are ADAGRAD
(Duchi2011), RMSPROP
(Tieleman2012), ADADELTA (zeiler2012), and ADAM (Kingma2014AdamAM), as well as the structured matrix scaling version SHAMPOO (Gupta2018ShampooPS) of ADAGRAD. These methods are scaleinvariant but do not avoid the need for prior tunning of the base learning rate.A large family of algorithms exploit a secondorder approximation of the objective function to better capture its local curvature and avoid the manual choice of a learning rate. Stochastic LBFGS methods have been recently studied in (zhougaodon) and (bollapragadaBFGS)
. Although fast in theory (in terms of the number of iterations), they are more expensive in terms of iteration cost and memory requirement, thus they introduce a considerable overhead which makes the application of these approaches usually less attractive than firstorder methods for modern deep learning architectures. Other Hessian approximationbased methods have been also proposed in
(hessianfree), (MartensG15), (tonga), (naturalg) that are based on the GaussNewton and natural gradient methods. However, pure SGD is still of interest as it often has better generalization properties (sgdgen).3 Our contributions
 A new stochastic optimization method is proposed that uses an adaptive step size that depends on the local curvature of the objective function eliminating the need for the user to perform prior learning rate tuning or perform a line search.
 The adaptive step size is computed efficiently using a Hessianvector product (without requiring computing the full Hessian
(Pearlmutter)) that has the same cost as a gradient computation. Hence, the resulting method can be applied to train to DNNs using backpropagation.
 A dynamic batch sampling strategy is proposed that increases the batch size progressively, using ideas from (Bollapragada2017AdaptiveSS) but based upon a new "acuteangle" test. In contrast, the approach in (zhougaodon) uses empirical processes concentration inequalities resulting in worstcase lowerbounds on batch sizes that increase rapidly to full batch.
 Our Adaptive framework can also be combined with other variants of SGD (ADAM, ADADELTA, ADAGRAD etc). We present an adaptive version of ADAM and show numerically that the resulting method is able to select excellent base learning rates.
 An interesting empirical observation in our DNN experiments was that, using our Adaptive framework on a convolutional neural network and a ResNet, we never encountered negative local curvature along the step direction.
4 A Dynamic Sampling AdaptiveSGD Method
4.1 Definitions and notation
We propose an iterative method of the following form. At the kth iteration, we draw i.i.d samples and define the empirical objective function
An approximation to the gradient of can be obtained by sampling. Let denote the set of samples used to estimate the Hessian, and let
An approximation to the Hessian of can also be obtained by sampling. At the iterate , we define
A firstorder method based on these approximations is then given by
(3) 
where, in our approach, is the adaptive step size, and the sets of samples and that are used to estimate the gradient and Hessian adaptively grow as needed, and are drawn independently. In the following sections, we will describe how we choose the step size and the batch sizes and . We use the following notation:

.

.

.

.

; .
Definition 1.
A convex function is selfconcordant if there exists a constant such that for every and every , we have :
is standard selfconcordant if the above is satisfied for .
Selfconcordant functions were introduced by Nesterov and Nemirovski in the context of interiorpoint methods (Nesterov1994InteriorpointPM). Many problems in machine learning have selfconcordant formulations: In (zhangb15disco) and (Bach2009SelfconcordantAF), it is shown that regularized regression, with either logistic loss or hinge loss, is selfconcordant.
4.2 Assumptions
We first give the intuition behind the sampling strategies and stepsize used in our method. The formal proof of the convergence of our algorithm will be given in the Appendix. Throughout this section, our theoretical analysis is based on the following technical assumptions on :

Hessian regularity: s.t :
The condition number, denoted is given by .

Uniform Sampled Hessian regularity: For the above defined , for every sample set and , we have :

Gradient regularity: We assume that the norm of the gradient of stays bounded above during the optimization procedure:

Function regularity: is standard selfconcordant.
4.3 Analysis
Consider the iterative method for minimizing self concordant functions described in (3) using the stepsize , where and .
Methods of this type have been analyzed in (TranDinh2015CompositeCM) and (gaogoldself) in the deterministic setting. In the latter paper, the above choice of is shown to guarantees a decrease in the function value.
Lemma 1.
(Lemma 4.1, gaogoldself)
For standard selfconcordant, for all :
Let the above statement is equivalent to :
(4) 
If then :
where .
Since we do not want to compute the full Hessian at , we cannot compute and (which is the true value of the argmax of ). Instead, we use their respective estimates using a subsampled Hessian:

[label=]

.

.
Consequently the update step in practice will be :
We will initially study the behaviour of the estimated values , using the framework described in (RoostaKhorasani2019)
, that shows that, for gradient and Hessian subsampling, using random matrix concentration inequalities, one can subsample in a way that first and secondorder information, i.e., curvature, are well estimated with probability of at least
, where can be chosen arbitrarily.Lemma 2.
(Lemma 2 in (RoostaKhorasani2019) Under Assumptions 1. and 2., given any , if we have:
(5) 
where is the operator norm induced by the Euclidean norm.
Lemma 3.
(Lemma 3 in (RoostaKhorasani2019) If for any and any and , then
where
If we sample the Hessian at the rate described in Lemma 2, the following inequalities are satisfied with probability of at least
This allows us to estimate the curvature along an arbitrary direction by subsampling the local norm squared, , with precision . This combined with Lemma 1 and the use of step size , guarantees a decrease in with probability of at least that
Using the gradient sampling rate in Lemma 3, we can control the stochasticity of the right most term above to obtain
Theorem 1.
Let . Suppose that satisfies the Assumptions 14 and that we have an efficient way to compute where . Let be the set of iterates, and sample Hessian and gradient batches generated by taking the step at iteration starting from any , where are chosen such that Lemma 3 and Lemma 2 are satisfied at each iteration. Then, for any , with probability we have :
where
As pointed out in (RoostaKhorasani2019), for many generalized linear model (that are also selfconcordant), the upper bounds and are known. However, computing these values for other settings, (e.g. DNNs) can be a challenging task. On the other hand, the above method often leads to a large batchsize since we are requiring very strong guarantees on the quality of the estimates. More specifically, we require the estimated curvature along any direction (not just the descent direction) to be close to the true curvature, as well as the sampled gradient to be close (in norm) to the full gradient which is overkill to controlling the error in the term .
In the following sections, we will refine the batchsize strategy to be less aggressive and we will provide heuristics to achieve such conditions.
4.4 Batchsize strategy
4.5 Gradient subsampling
We propose to build upon the strategy introduced by (Bollapragada2017AdaptiveSS) in the context of firstorder methods. Their inner product test determines a sample size such that the search direction is a descent direction with high probability. To control the term , we propose instead, to choose a sample size that satisfies the following new test that we call the acuteangle test:
(6) 
Geometrically, this ensures that the normalized sampled gradient is close to its projection on the true normalized gradient (see Figures 1, 2).
The lefthand side is difficult to compute and the computation of can be prohibitively expensive but can be approximated using a running average of the sampled gradient. Therefore we propose the following approximate version of the test (using , for ):
(7) 
We refer to (4.5) as the exact acuteangle test and (7) as the approximate acuteangle test. This approximate test will serve as an update criteria for the sample size. In other words, if the condition above is not satisfied we take as a new batch size:
(8) 
We can show, using Markov’s inequality, that if (4.5) is satisfied, then with probability at least , we have:
4.6 Hessian subsampling
Computing the Hessian subsampling size described in Lemma 2 requires for given , can be difficult in practice to ensure in complex models such as DNNs where the condition number is hard to compute and where the above bound could lead to very large batchsizes. Instead of computing the the Hessian subsampling set such that the concentration inequality (5) holds, we use the sample set used to compute the stochastic gradient and we choose such that:
(9) 
Inequality (9) is less restrictive than (5) where the precision on the curvature is required for any arbitrary direction. Using Markov’s inequality and similar approximation ideas introduced by (Bollapragada2017AdaptiveSS), we can derive an approximate test version and update formula for such that (9) is satisfied:
(10) 
(11) 
with for and the expectation is conditioned on . The intuitive interpretation of this technique is that we penalize the step size with a factor such that encodes the information on how accurate the subsampled Hessian preserves the curvature information along compared to the full Hessian.
This approximate approach to update the batchsize is given in the Algorithm 1.
4.7 Dynamic Sampling AdaptiveSGD
Based on the update criteria developed above we propose Algorithm 2.
4.8 Convergence theorem:
In this section, we state the main lemma and theorem which prove global convergence with high probability of Algorithm 2.
The acuteangle test, in conjunction with the Hessian subsampling allows the algorithm to make sufficient progress with high probability at every iteration if is chosen very small.
Lemma 4.
Suppose that satisfies Assumptions 14, Let be the iterates generated by the idealized Algorithm 2 where is chosen such that the (exact) acuteangle test (4.5) and Hessian subsampling test (10) are satisfied at each iteration for any given constants . Then, for all , starting from any ,
with probability at least . Moreover, with probability , we have
where
The proof of the above lemma follows the nonasymptotic probabilistic analysis of subsampled Newton methods in (RoostaKhorasani2019), that allows for a small, yet nonzero, probability of occurrence of “bad events” in each iteration. Hence, the accumulative probability of occurrence of “good events” decreases with each iteration. Although the term “convergence” typically implies the asymptotic limit of an infinite sequence, we analyse here the nonasymptotic behavior of a finite number of random iterates and provide probabilistic results about their properties. Our main convergence theorem is:
Theorem 2.
Suppose that satisfies Assumptions 14, Let be the iterates generated by the idealized Algorithm 2 where is chosen such that the (exact) acuteangle test (4.5) and Hessian subsampling test (10) are satisfied at each iteration for any given constants . Then, for all , starting from any , with probability of at least , we have
where
In order to obtain a nonzero probability of asymptotic convergence, we can decrease the probability p of failure at a rate faster that and the above result will hold with nonzero probability asymptotically at the expense of a more aggressive batchsize increasing strategy.
5 Scaleinvariant version: AdaADAM
A number of popular methods, especially in deep learning, choose per element update magnitudes based on past gradient observations. In this section, we will adapt the elementwise magnitude update aspect of ADAM algorithm to our adaptive framework, which leads to a new algorithm that enjoys more stability and reduces the variance arising from the stochasticity of gradient and curvature estimation. The framework is general enough to combine with other variants of SGD such as RMSPROP, ADADELTA and ADAGRAD, as well as secondorder methods such as Block BFGS (Gower2016StochasticBB), (Gao2016BlockBM).
The ADAM optimizer maintains moving averages of stochastic gradients and their elementwise squares:
(12)  
(13) 
with , and updates
Since our framework can be applied to any direction , we propose the following Adaptive learning rate version of ADAM :
6 Practical considertations
6.1 Hessian vector product
In our framework, we need to compute the curvature along a direction given by : . Hence we need to efficiently compute the Hessianvector product . Fortunately, for functions that can be computed using a computational graph (Logistic regression, DNNs, etc) there are automatic methods available for computing Hessianvector products exactly (Pearlmutter), which take about as much computation as a gradient evaluation. Hence, if is a minibatch stochastic gradient, can be computed with essentially the same effort as that needed to compute . The method described in (Pearlmutter) is based on the differential operator:
Since and , to compute , Pearlmutter applies to the backpropagation equations used to compute .
6.2 Individual gradients in Neural Network setting
Unfortunately, the differentiation functionality provided by most software frameworks (Tensorflow, Pytorch, etc.) does not support computing gradients with respect to individual samples in a minibatch making it expensive to apply the adaptive batchsize framework in that setting. On the other hand, it is usually impractical to substantially increase the batchsize due to memorylimitations. As a result, we only compute the adaptive learning rate part of our framework in the neural network setting at fixed milestone epochs for a small number of iterations (similar to learning rate schedulers that are widely used in practice) and we fix the batchsize between milestones, thus the problem of minibatch computation does not arise.
7 Numerical Experiments
In this section, we present the results of numerical experiments comparing our AdaptiveSGD method to vanilla SGD with optimized learning rate, SGD with innerproduct test, SGD with augmented innerproduct test, and SGD with norm test to demonstrate the ability of our framework to capture the "best learning rate" without prior tuning as well as the effectiveness and cost of our method against other progressive sampling methods. In all of our experiments, we fixed ; i.e., we wanted a decrease guarantee to be true with probability and , where is the angle between the gradient and the sampled gradient at iteration . We also decreased by a factor 0.9 every 10 iterations to ensure that the cumulative probability does not converge to 0. These parameters worked well in our experiments; of course on could choose more restrictive parameters but they would result in a more aggressive batchsize increase. The batchsize used for vanilla SGD and ADAM was fixed at 128 in our neural network experiments.
7.1 Datasets
For binary classification using logistic regression, we chose 6 data sets from LIBSVM (libsvm) with a variety of dimensions and sample sizes, which are listed in Table (1).
Data Set  Data Points  Dimension 

covertype  581012  54 
webspam  350000  254 
rcv1train.binary  47237  20242 
realsim  20959  72309 
a1a  30,956  123 
ionosphere  351  34 
For our DNN experiments, we experimentally validated the ability of the adaptive framework AdaSGD and AdaADAM to determine a good learning rate without prior fine tuning on a twolayer convolutional neural network with a ReLU activation function and batch normalization, and ResNet18
(resnetpaper) on the MNIST Dataset (mnist), and CIFAR10 Dataset (cifar).7.2 Binary classification logistic regression
We considered binary classification problems where the objective function is given by the logistic loss with regularization, with :
7.2.1 Comparison of different batchincreasing strategies
Numerical results on Covertype and Webspam datasets are reported in Figures 3, 4. For results on other datasets mentioned in Table 1, please refer to the supplementary materials.
We observe that the AdaSGD successfully determines adequate learning rates and outperforms the vanilla SGD with optimized learning rate, SGD with innerproduct test, SGD with augmented innerproduct test, SGD with norm test, as shown in Figure 3. We also observe that the batchsize increases gradually and stays relatively small compared to the other methods.
In Figure 4, we observe that the AdaSGD successfully determines adequate learning rates and outperforms the vanilla SGD with different learning rates. One interesting observation is that the learning rates determined by the AdaSGD algorithm oscillate around the learning rate with value 2 which turned out to be the optimal learning rate for the vanilla SGD. The Eta subfigure in Figure 4 refers to , which quantifies the decrease guarantee in our framework and should converge to zero since the objective function is bounded below.
The above observations about the learning rate computed by our adaptive approach on logistic regression motivated us to introduce a hybrid version of our framework, where we run our framework, for a relatively small number of iterations in the beginning of the some milestone epochs, and then use the median of the adaptive learning rates as a constant learning rate until the next milestone (similar to learning rate schedulers that are widely used in practice).
Another interesting observation is that the AdaptiveSGD sometimes takes large steps but manages to stay stable compared to vanilla SGD with a large step size. This is manly due to both the adaptive step computation based on local curvature and batchincreasing strategy that reduces the stochasticity of the gradient.
7.3 Deep neural network results
Although the loss function corresponding to neural network models is generally nonconvex, we were motivated to apply our framework to training deep learning models based on the recent experimental observations in (hessiannn)
. This work numerically studied the spectral density of the Hessian of deep neural network loss functions throughout the optimization process and observed that the negative eigenvalues tend to disappear rather rapidly in the optimization process. In our case, we are only interested in the curvature along the gradient direction. Surprisingly, we found that negative curvature was rarely encountered at the current iterate along the step direction. Hence, when a negative value was encountered, we simply took the median of the step sizes over the most recent K steps as a heuristic. In our experiments we set K = 20.
The AdaptiveSGD method successfully determined nearoptimal learning rates (see Figure 5) and outperformed vanilla SGD with different learning rates in the range of the chosen adaptive learning rates. The CNN we used was a shallow twolayer convolutional neural network with ReLu activation function and batchnormalization as well as a fully connected layer at the end.
We also experimented with a large scale model using ResNet18 on the CIFAR10 dataset, where we exploited the adaptive framework on SGD with momentum and ADAM to automatically tune the learning rate at the beginning of epochs 1, 25, 50, … 150 (using first 20 minibatch iterations to roughly estimate a good constant learning rate between milestones). This method can be seen as a substitute for the common approach of using a learning rate schedule to train large neural nets such as ResNet, DenseNet..etc where typically the user supplies an initial learning rate, milestone epochs and a decay lrfactor. We tested our framework against hyperparameters used in (resnetpaper)
and our adaptive SGD with momentum achieved roughly the same performance without any hyperparameter tuning. We believe this tool will be useful for selecting good learning rates. Results are presented in Figure
6. Additional numerical results obtained on the CIFAR10 dataset using VGG11(VGGNet) DNN are presented in Appendix B.2.8 Conclusion
We presented an adaptive framework for stochastic optimization that we believe is a valuable tool for fast and practical learning rate tuning, especially in deep learning applications, and which can also be combined with popular variants of SGD such as ADAM, AdaGrad, etc. Studying theoretical convergence guarantees of our method in DNNs which generate nonconvex loss functions, and the convergence or our adaptive framework with other variants of SGD suggest interesting avenues for future research.
Acknowledgements
We thank Wenbo Gao and Chaoxu Zhou for helpful discussions about the convergence results. Partial support for this research has been provided by NSF Grant CCF 1838061.
References
Appendix Appendix A Proofs
Notation:

[label=]

.

.

.

.

; .

.

.
Theorem 1 Let . Suppose that satisfies the Assumptions 14 and that we have an efficient way to compute where . Let be the set of iterates, and sample Hessian and gradient batches generated by taking the step at iteration starting from any , where are chosen such that Lemma 3 and Lemma 2 are satisfied at each iteration. Then, for any , with probability we have :
where
Proof.
By Assumption 3, choosing insures that the conditions for Lemma 3 are satisfied. Therefore, we have with probability :
Lemma 3  
Also:
Hence, with probability :
(14) 
Let , using Lemma (1), we have:
(15) 
If we sample the Hessian at the rate described in Lemma 2, the following inequality is satisfied with probability :
Taking results into the following inequalities being satisfied with probability :
(16) 
We have, for :
Using the monotonicity of , (14) and the choice of step size as , we have:
(17) 
The choice of is possible because we now prove that . Let clearly is decreasing as a function of . Using the inequality (14) we have
Now we obtain a decrease guarantee of the function with probability
Now, using the CauchySchwarz inequality, inequalities (14) and Assumption 13 we have:
By observing that satisfies for all , with probability , we have:
with . Using Assumption 2, we have . Therefore, we have:
Using (14), we have with probability :
(18) 
It is well known (bertsekas2003convex) that for strongly convex functions:
Substituting this in (18) and subtracting from both sides, we obtain with probability :
with :
which is the desired result. ∎
Lemma 4. Suppose that satisfies Assumptions 14, Let be the iterates generated by the idealized Algorithm 2 where is chosen such that the (exact) acuteangle test (4.5) and Hessian subsampling test (10) are satisfied at each iteration for any given constants . Then, for all , starting from any ,
With probability at least . Moreover, with probability , we have
where
Proof.
Since the (exact) acuteangle test (4.5
) is satisfied, using Markov’s inequality on the positive random variable
Comments
There are no comments yet.