The problem of interest is minimizing for functions of the following form :
common to problems in statistics and machine learning. For instance, in the empirical risk minimization framework, a model is learned from a set
of training data by minimizing an empirical loss function of the form
where we define , and is the composition of a prediction function (parametrized by ) and a loss function, and
are random input-output pairs with the uniform discrete probability distribution. An objective function of the form (1) is often impractical, as the distribution of is generally unavailable, making it infeasible to analytically compute
. This can be resolved by replacing the expectation by the estimate (2
). The strong law of large numbers implies that the sample mean in (2) converges almost surely to (1) as the number of samples increases. However, in practice, even problem (2) is not a tractable for classical optimization algorithms, as the amount of data is usually extremely large. A better strategy when optimizing (2) is to consider sub-samples of the data to reduce the computational cost. This leads to stochastic algorithms where the objective function changes at each iteration by randomly selecting sub-samples.
2 Related work
Stochastic gradient descent SGD (vanillasgd), using either single or mini-batch samples is widely-used; however, it offers several challenges in terms of convergence speed. Hence, the topic of learning rate tuning has been extensively investigated. Modified versions of SGD based on the use of momentum terms (Polyak64); (NESTEROV83)
have been showed to speed-up the convergence for SGD on smooth convex functions. In DNNs, other popular variants of SGD scale the individual components of the stochastic gradient using past gradient observations in order to deal with variances in the magnitude in the stochastic gradient components (especially between layers). Among these are ADAGRAD(Duchi2011)
, RMSPROP(Tieleman2012), ADADELTA (zeiler2012), and ADAM (Kingma2014AdamAM), as well as the structured matrix scaling version SHAMPOO (Gupta2018ShampooPS) of ADAGRAD. These methods are scale-invariant but do not avoid the need for prior tunning of the base learning rate.
A large family of algorithms exploit a second-order approximation of the objective function to better capture its local curvature and avoid the manual choice of a learning rate. Stochastic L-BFGS methods have been recently studied in (zhougaodon) and (bollapragadaBFGS)
. Although fast in theory (in terms of the number of iterations), they are more expensive in terms of iteration cost and memory requirement, thus they introduce a considerable overhead which makes the application of these approaches usually less attractive than first-order methods for modern deep learning architectures. Other Hessian approximation-based methods have been also proposed in(hessianfree), (MartensG15), (tonga), (naturalg) that are based on the Gauss-Newton and natural gradient methods. However, pure SGD is still of interest as it often has better generalization properties (sgdgen).
3 Our contributions
- A new stochastic optimization method is proposed that uses an adaptive step size that depends on the local curvature of the objective function eliminating the need for the user to perform prior learning rate tuning or perform a line search.
- The adaptive step size is computed efficiently using a Hessian-vector product (without requiring computing the full Hessian(Pearlmutter)
) that has the same cost as a gradient computation. Hence, the resulting method can be applied to train to DNNs using backpropagation.
- A dynamic batch sampling strategy is proposed that increases the batch size progressively, using ideas from (Bollapragada2017AdaptiveSS) but based upon a new "acute-angle" test. In contrast, the approach in (zhougaodon) uses empirical processes concentration inequalities resulting in worst-case lower-bounds on batch sizes that increase rapidly to full batch.
- Our Adaptive framework can also be combined with other variants of SGD (ADAM, ADADELTA, ADAGRAD etc). We present an adaptive version of ADAM and show numerically that the resulting method is able to select excellent base learning rates.
- An interesting empirical observation in our DNN experiments was that, using our Adaptive framework on a convolutional neural network and a ResNet, we never encountered negative local curvature along the step direction.
4 A Dynamic Sampling Adaptive-SGD Method
4.1 Definitions and notation
We propose an iterative method of the following form. At the k-th iteration, we draw i.i.d samples and define the empirical objective function
An approximation to the gradient of can be obtained by sampling. Let denote the set of samples used to estimate the Hessian, and let
An approximation to the Hessian of can also be obtained by sampling. At the iterate , we define
A first-order method based on these approximations is then given by
where, in our approach, is the adaptive step size, and the sets of samples and that are used to estimate the gradient and Hessian adaptively grow as needed, and are drawn independently. In the following sections, we will describe how we choose the step size and the batch sizes and . We use the following notation:
A convex function is self-concordant if there exists a constant such that for every and every , we have :
is standard self-concordant if the above is satisfied for .
Self-concordant functions were introduced by Nesterov and Nemirovski in the context of interior-point methods (Nesterov1994InteriorpointPM). Many problems in machine learning have self-concordant formulations: In (zhangb15disco) and (Bach2009SelfconcordantAF), it is shown that regularized regression, with either logistic loss or hinge loss, is self-concordant.
We first give the intuition behind the sampling strategies and step-size used in our method. The formal proof of the convergence of our algorithm will be given in the Appendix. Throughout this section, our theoretical analysis is based on the following technical assumptions on :
Hessian regularity: s.t :
The condition number, denoted is given by .
Uniform Sampled Hessian regularity: For the above defined , for every sample set and , we have :
Gradient regularity: We assume that the norm of the gradient of stays bounded above during the optimization procedure:
Function regularity: is standard self-concordant.
Consider the iterative method for minimizing self- concordant functions described in (3) using the step-size , where and .
Methods of this type have been analyzed in (TranDinh2015CompositeCM) and (gaogoldself) in the deterministic setting. In the latter paper, the above choice of is shown to guarantees a decrease in the function value.
(Lemma 4.1, gaogoldself)
For standard self-concordant, for all :
Let the above statement is equivalent to :
If then :
Since we do not want to compute the full Hessian at , we cannot compute and (which is the true value of the argmax of ). Instead, we use their respective estimates using a sub-sampled Hessian:
Consequently the update step in practice will be :
We will initially study the behaviour of the estimated values , using the framework described in (Roosta-Khorasani2019)
, that shows that, for gradient and Hessian sub-sampling, using random matrix concentration inequalities, one can sub-sample in a way that first and second-order information, i.e., curvature, are well estimated with probability of at least, where can be chosen arbitrarily.
(Lemma 2 in (Roosta-Khorasani2019) Under Assumptions 1. and 2., given any , if we have:
where is the operator norm induced by the Euclidean norm.
(Lemma 3 in (Roosta-Khorasani2019) If for any and any and , then
If we sample the Hessian at the rate described in Lemma 2, the following inequalities are satisfied with probability of at least
This allows us to estimate the curvature along an arbitrary direction by sub-sampling the local norm squared, , with precision . This combined with Lemma 1 and the use of step size , guarantees a decrease in with probability of at least that
Using the gradient sampling rate in Lemma 3, we can control the stochasticity of the right most term above to obtain
Let . Suppose that satisfies the Assumptions 1-4 and that we have an efficient way to compute where . Let be the set of iterates, and sample Hessian and gradient batches generated by taking the step at iteration starting from any , where are chosen such that Lemma 3 and Lemma 2 are satisfied at each iteration. Then, for any , with probability we have :
As pointed out in (Roosta-Khorasani2019), for many generalized linear model (that are also self-concordant), the upper bounds and are known. However, computing these values for other settings, (e.g. DNNs) can be a challenging task. On the other hand, the above method often leads to a large batch-size since we are requiring very strong guarantees on the quality of the estimates. More specifically, we require the estimated curvature along any direction (not just the descent direction) to be close to the true curvature, as well as the sampled gradient to be close (in norm) to the full gradient which is overkill to controlling the error in the term .
In the following sections, we will refine the batch-size strategy to be less aggressive and we will provide heuristics to achieve such conditions.
4.4 Batch-size strategy
4.5 Gradient sub-sampling
We propose to build upon the strategy introduced by (Bollapragada2017AdaptiveSS) in the context of first-order methods. Their inner product test determines a sample size such that the search direction is a descent direction with high probability. To control the term , we propose instead, to choose a sample size that satisfies the following new test that we call the acute-angle test:
The left-hand side is difficult to compute and the computation of can be prohibitively expensive but can be approximated using a running average of the sampled gradient. Therefore we propose the following approximate version of the test (using , for ):
We refer to (4.5) as the exact acute-angle test and (7) as the approximate acute-angle test. This approximate test will serve as an update criteria for the sample size. In other words, if the condition above is not satisfied we take as a new batch size:
We can show, using Markov’s inequality, that if (4.5) is satisfied, then with probability at least , we have:
4.6 Hessian sub-sampling
Computing the Hessian sub-sampling size described in Lemma 2 requires for given , can be difficult in practice to ensure in complex models such as DNNs where the condition number is hard to compute and where the above bound could lead to very large batch-sizes. Instead of computing the the Hessian sub-sampling set such that the concentration inequality (5) holds, we use the sample set used to compute the stochastic gradient and we choose such that:
Inequality (9) is less restrictive than (5) where the precision on the curvature is required for any arbitrary direction. Using Markov’s inequality and similar approximation ideas introduced by (Bollapragada2017AdaptiveSS), we can derive an approximate test version and update formula for such that (9) is satisfied:
with for and the expectation is conditioned on . The intuitive interpretation of this technique is that we penalize the step size with a factor such that encodes the information on how accurate the sub-sampled Hessian preserves the curvature information along compared to the full Hessian.
This approximate approach to update the batch-size is given in the Algorithm 1.
4.7 Dynamic Sampling Adaptive-SGD
Based on the update criteria developed above we propose Algorithm 2.
4.8 Convergence theorem:
In this section, we state the main lemma and theorem which prove global convergence with high probability of Algorithm 2.
The acute-angle test, in conjunction with the Hessian sub-sampling allows the algorithm to make sufficient progress with high probability at every iteration if is chosen very small.
Suppose that satisfies Assumptions 1-4, Let be the iterates generated by the idealized Algorithm 2 where is chosen such that the (exact) acute-angle test (4.5) and Hessian sub-sampling test (10) are satisfied at each iteration for any given constants . Then, for all , starting from any ,
with probability at least . Moreover, with probability , we have
The proof of the above lemma follows the non-asymptotic probabilistic analysis of sub-sampled Newton methods in (Roosta-Khorasani2019), that allows for a small, yet non-zero, probability of occurrence of “bad events” in each iteration. Hence, the accumulative probability of occurrence of “good events” decreases with each iteration. Although the term “convergence” typically implies the asymptotic limit of an infinite sequence, we analyse here the non-asymptotic behavior of a finite number of random iterates and provide probabilistic results about their properties. Our main convergence theorem is:
Suppose that satisfies Assumptions 1-4, Let be the iterates generated by the idealized Algorithm 2 where is chosen such that the (exact) acute-angle test (4.5) and Hessian sub-sampling test (10) are satisfied at each iteration for any given constants . Then, for all , starting from any , with probability of at least , we have
In order to obtain a non-zero probability of asymptotic convergence, we can decrease the probability p of failure at a rate faster that and the above result will hold with non-zero probability asymptotically at the expense of a more aggressive batch-size increasing strategy.
5 Scale-invariant version: Ada-ADAM
A number of popular methods, especially in deep learning, choose per element update magnitudes based on past gradient observations. In this section, we will adapt the element-wise magnitude update aspect of ADAM algorithm to our adaptive framework, which leads to a new algorithm that enjoys more stability and reduces the variance arising from the stochasticity of gradient and curvature estimation. The framework is general enough to combine with other variants of SGD such as RMSPROP, ADADELTA and ADAGRAD, as well as second-order methods such as Block BFGS (Gower2016StochasticBB), (Gao2016BlockBM).
The ADAM optimizer maintains moving averages of stochastic gradients and their element-wise squares:
with , and updates
Since our framework can be applied to any direction , we propose the following Adaptive learning rate version of ADAM :
6 Practical considertations
6.1 Hessian vector product
In our framework, we need to compute the curvature along a direction given by : . Hence we need to efficiently compute the Hessian-vector product . Fortunately, for functions that can be computed using a computational graph (Logistic regression, DNNs, etc) there are automatic methods available for computing Hessian-vector products exactly (Pearlmutter), which take about as much computation as a gradient evaluation. Hence, if is a mini-batch stochastic gradient, can be computed with essentially the same effort as that needed to compute . The method described in (Pearlmutter) is based on the differential operator:
Since and , to compute , Pearlmutter applies to the back-propagation equations used to compute .
6.2 Individual gradients in Neural Network setting
Unfortunately, the differentiation functionality provided by most software frameworks (Tensorflow, Pytorch, etc.) does not support computing gradients with respect to individual samples in a mini-batch making it expensive to apply the adaptive batch-size framework in that setting. On the other hand, it is usually impractical to substantially increase the batch-size due to memory-limitations. As a result, we only compute the adaptive learning rate part of our framework in the neural network setting at fixed milestone epochs for a small number of iterations (similar to learning rate schedulers that are widely used in practice) and we fix the batch-size between milestones, thus the problem of mini-batch computation does not arise.
7 Numerical Experiments
In this section, we present the results of numerical experiments comparing our Adaptive-SGD method to vanilla SGD with optimized learning rate, SGD with inner-product test, SGD with augmented inner-product test, and SGD with norm test to demonstrate the ability of our framework to capture the "best learning rate" without prior tuning as well as the effectiveness and cost of our method against other progressive sampling methods. In all of our experiments, we fixed ; i.e., we wanted a decrease guarantee to be true with probability and , where is the angle between the gradient and the sampled gradient at iteration . We also decreased by a factor 0.9 every 10 iterations to ensure that the cumulative probability does not converge to 0. These parameters worked well in our experiments; of course on could choose more restrictive parameters but they would result in a more aggressive batch-size increase. The batch-size used for vanilla SGD and ADAM was fixed at 128 in our neural network experiments.
For binary classification using logistic regression, we chose 6 data sets from LIBSVM (libsvm) with a variety of dimensions and sample sizes, which are listed in Table (1).
|Data Set||Data Points||Dimension|
For our DNN experiments, we experimentally validated the ability of the adaptive framework Ada-SGD and Ada-ADAM to determine a good learning rate without prior fine tuning on a two-layer convolutional neural network with a ReLU activation function and batch normalization, and ResNet18(resnetpaper) on the MNIST Dataset (mnist), and CIFAR10 Dataset (cifar).
7.2 Binary classification logistic regression
We considered binary classification problems where the objective function is given by the logistic loss with regularization, with :
7.2.1 Comparison of different batch-increasing strategies
We observe that the Ada-SGD successfully determines adequate learning rates and outperforms the vanilla SGD with optimized learning rate, SGD with inner-product test, SGD with augmented inner-product test, SGD with norm test, as shown in Figure 3. We also observe that the batch-size increases gradually and stays relatively small compared to the other methods.
In Figure 4, we observe that the Ada-SGD successfully determines adequate learning rates and outperforms the vanilla SGD with different learning rates. One interesting observation is that the learning rates determined by the Ada-SGD algorithm oscillate around the learning rate with value 2 which turned out to be the optimal learning rate for the vanilla SGD. The Eta sub-figure in Figure 4 refers to , which quantifies the decrease guarantee in our framework and should converge to zero since the objective function is bounded below.
The above observations about the learning rate computed by our adaptive approach on logistic regression motivated us to introduce a hybrid version of our framework, where we run our framework, for a relatively small number of iterations in the beginning of the some milestone epochs, and then use the median of the adaptive learning rates as a constant learning rate until the next milestone (similar to learning rate schedulers that are widely used in practice).
Another interesting observation is that the Adaptive-SGD sometimes takes large steps but manages to stay stable compared to vanilla SGD with a large step size. This is manly due to both the adaptive step computation based on local curvature and batch-increasing strategy that reduces the stochasticity of the gradient.
7.3 Deep neural network results
Although the loss function corresponding to neural network models is generally non-convex, we were motivated to apply our framework to training deep learning models based on the recent experimental observations in (hessiannn)
. This work numerically studied the spectral density of the Hessian of deep neural network loss functions throughout the optimization process and observed that the negative eigenvalues tend to disappear rather rapidly in the optimization process. In our case, we are only interested in the curvature along the gradient direction. Surprisingly, we found that negative curvature was rarely encountered at the current iterate along the step direction. Hence, when a negative value was encountered, we simply took the median of the step sizes over the most recent K steps as a heuristic. In our experiments we set K = 20.
The Adaptive-SGD method successfully determined near-optimal learning rates (see Figure 5) and outperformed vanilla SGD with different learning rates in the range of the chosen adaptive learning rates. The CNN we used was a shallow two-layer convolutional neural network with ReLu activation function and batch-normalization as well as a fully connected layer at the end.
We also experimented with a large scale model using ResNet18 on the CIFAR10 dataset, where we exploited the adaptive framework on SGD with momentum and ADAM to automatically tune the learning rate at the beginning of epochs 1, 25, 50, … 150 (using first 20 mini-batch iterations to roughly estimate a good constant learning rate between milestones). This method can be seen as a substitute for the common approach of using a learning rate schedule to train large neural nets such as ResNet, DenseNet..etc where typically the user supplies an initial learning rate, milestone epochs and a decay lr-factor. We tested our framework against hyper-parameters used in (resnetpaper)
and our adaptive SGD with momentum achieved roughly the same performance without any hyperparameter tuning. We believe this tool will be useful for selecting good learning rates. Results are presented in Figure6. Additional numerical results obtained on the CIFAR10 dataset using VGG11(VGGNet) DNN are presented in Appendix B.2.
We presented an adaptive framework for stochastic optimization that we believe is a valuable tool for fast and practical learning rate tuning, especially in deep learning applications, and which can also be combined with popular variants of SGD such as ADAM, AdaGrad, etc. Studying theoretical convergence guarantees of our method in DNNs which generate non-convex loss functions, and the convergence or our adaptive framework with other variants of SGD suggest interesting avenues for future research.
We thank Wenbo Gao and Chaoxu Zhou for helpful discussions about the convergence results. Partial support for this research has been provided by NSF Grant CCF 1838061.
Appendix Appendix A Proofs
Theorem 1 Let . Suppose that satisfies the Assumptions 1-4 and that we have an efficient way to compute where . Let be the set of iterates, and sample Hessian and gradient batches generated by taking the step at iteration starting from any , where are chosen such that Lemma 3 and Lemma 2 are satisfied at each iteration. Then, for any , with probability we have :
By Assumption 3, choosing insures that the conditions for Lemma 3 are satisfied. Therefore, we have with probability :
Hence, with probability :
Let , using Lemma (1), we have:
If we sample the Hessian at the rate described in Lemma 2, the following inequality is satisfied with probability :
Taking results into the following inequalities being satisfied with probability :
We have, for :
Using the monotonicity of , (14) and the choice of step size as , we have:
The choice of is possible because we now prove that . Let clearly is decreasing as a function of . Using the inequality (14) we have
Now we obtain a decrease guarantee of the function with probability
Now, using the Cauchy-Schwarz inequality, inequalities (14) and Assumption 1-3 we have:
By observing that satisfies for all , with probability , we have:
with . Using Assumption 2, we have . Therefore, we have:
Using (14), we have with probability :
It is well known (bertsekas2003convex) that for strongly convex functions:
Substituting this in (18) and subtracting from both sides, we obtain with probability :
which is the desired result. ∎
Lemma 4. Suppose that satisfies Assumptions 1-4, Let be the iterates generated by the idealized Algorithm 2 where is chosen such that the (exact) acute-angle test (4.5) and Hessian sub-sampling test (10) are satisfied at each iteration for any given constants . Then, for all , starting from any ,
With probability at least . Moreover, with probability , we have