DeepAI

# Bolstering Stochastic Gradient Descent with Model Building

Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the stepsize. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates a second-order information that allows adjusting not only the stepsize but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in most problems. Moreover, our experiments show that the proposed method is quite robust as it converges for a wide range of initial stepsizes.

• 13 publications
• 1 publication
• 1 publication
• 3 publications
05/24/2019

### Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Recent works have shown that stochastic gradient descent (SGD) achieves ...
03/07/2021

Stochastically controlled stochastic gradient (SCSG) methods have been p...
12/10/2018

### Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Batch Normalization (BN) has become a cornerstone of deep learning acros...
07/22/2018

### PaloBoost: An Overfitting-robust TreeBoost with Out-of-Bag Sample Regularization Techniques

Stochastic Gradient TreeBoost is often found in many winning solutions i...
01/07/2023

### An efficient and robust SAV based algorithm for discrete gradient systems arising from optimizations

We propose in this paper a new minimization algorithm based on a slightl...
10/02/2020

### A straightforward line search approach on the expected empirical loss for stochastic deep learning problems

A fundamental challenge in deep learning is that the optimal step sizes ...
09/11/2021

### Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

We present a novel adaptive optimization algorithm for large-scale machi...

## Code Repositories

### SMB

Stochastic gradient descent with model building

## 1 Stochastic Model Building

We introduce a new stochastic unconstrained optimization algorithm in order to approximately solve problems of the form

 minx∈Rn  f(x)=E[F(x,ξ)], (1)

where is continuously differentiable and possibly nonconvex,

denotes a random variable, and

denotes the expectation taken with respect to . We assume the existence of a stochastic first-order oracle which outputs a stochastic gradient of for a given . A common approach to tackle (1) is to solve the empirical risk problem

 minx∈Rn  f(x)=1NN∑i=1fi(x), (2)

where is the loss function corresponding to the th data sample, and denotes the data sample size which can be very large in modern applications.

As an alternative approach to line search for SGD, we propose a stochastic model building strategy inspired by the work of Oztoprak:2017. Unlike core SGD methods, our approach aims at including a curvature information that adjusts not only the stepsize but also the search direction. Oztoprak:2017 consider only the deterministic setting and they apply the model building strategy repetitively until a sufficient decent is achieved. In our stochastic setting, however, we have observed experimentally that multiple model steps does not benefit much to the performance, and its cost to the run time can be extremely high in deep learning problems. Therefore, if the sufficient decent is not achieved by the stochastic gradient step, then we construct only one model to adjust the size and the direction of the step.

Conventional stochastic quasi-Newton methods adjust the gradient direction by a scaling matrix that is constructed by the information from the previous steps. Our model building approach, however, uses the most recent curvature information around the latest iteration. In the popular deep learning model implementations, model parameters come in groups and updates are applied to each parameter group separately. Therefore, we also propose to build a model for each parameter group separately making the step lengths adaptive.

The proposed iterative algorithm SMB works as follows: At step , given the iterate , we calculate the stochastic function value and the mini-batch stochastic gradient at , where is the batch size and is the realization of the random vector . Then, we apply the SGD update to calculate the trial step , where is a sequence of learning rates. With this trial step, we also calculate the function and gradient values and at . Then, we check the stochastic Armijo condition

 ftk≤fk−c αk∥gk∥2, (3)

where is a hyper-parameter. If the condition is satisfied and we achieve sufficient decrease, then we set as the next step. If the Armijo condition is not satisfied, then we build a quadratic model using the linear models at the points and for each parameter group and find the step to reach its minimum point. Here, and denote respectively the coordinates of and that corresponds to the parameter group . We calculate the next iterate , where and is the number of parameter groups, and proceed to the next step with . This model step, if needed, requires extra mini-batch function and gradient evaluations (forward and backward pass in deep neural networks).

For each parameter group , the quadratic model is built by combining the linear models at and , given by

 l0k,p(s):=fk+g⊤k,ps    and    ltk,p(s−stk,p):=ftk+(gtk,p)⊤(s−stk,p),

respectively. Then, the quadratic model becomes

 mtk,p(s):=α0k,p(s)l0k,p(s)+αtk,p(s)ltk,p(s−stk,p),

where

 α0k,p(s)=(s−stk,p)⊤(−stk,p)(−stk,p)⊤(−stk,p)  and  αtk,p(s)=s⊤stk,p(stk,p)⊤stk,p.

The constraint

 ∥s∥2+∥s−stk,p∥2≤∥stk,p∥2,

is also imposed so that the minimum is attained in the region bounded by and . This constraint acts like a trust region. Figure 1 shows the steps of this construction.

In this work, we solve a relaxation of this constrained model as explained in (Oztoprak:2017, Section 2.2). The minimum value of the relaxed model is attained at the point with

 sk,p=cg,p(δ)gk,p+cy,p(δ)yk,p+cs,p(δ)stk,p, (4)

where . Here, the coefficients are given as

 cg,p(δ)=−∥stk,p∥22δ,cy,p(δ)=−∥stk,p∥22δθ[−(y⊤k,pstk,p+2δ)(stk,p)⊤gk,p+∥stk,p∥2y⊤k,pgk,p],
 cs,p(δ)=−∥stk,p∥22δθ[−(y⊤k,pstk,p+2δ)y⊤k,pgk,p+∥yk,p∥2(stk,p)⊤gk,p],

with

 θ=(y⊤k,pstk,p+2δ)2−∥stk,p∥2∥yk,p∥2 \normalsize and  δ=12(∥stk,p∥(∥yk,p∥+1η∥gk,p∥)−y⊤k,pstk,p), (5)

where is a constant which controls the size of by imposing the condition . Then, the adaptive model step becomes . We note that our construction in terms of different parameter groups lends itself to constructing a different model for each parameter subspace.

We summarize the steps of SMB in Algorithm 1. Line 1 shows the trial point, which is obtained with the standard stochastic gradient step. If this step satisfies the stochastic Armijo condition, then we proceed with the next iteration (line 1). Otherwise, we continue with bulding the models for each parameter group (lines 1- 1), and move to the next iteration with the model building step in line 1.

## 2 Convergence Analysis

The steps of SMB can be considered as a special quasi-Newton update:

 xk+1=xk−αkHkgk, (6)

where is a symmetric positive definite matrix as an approximation to the inverse Hessian matrix. In Appendix Proof of Theorem 2, we explain this connection and give an explicit formula for the matrix . We also prove that there exists such that for all the matrix satisfies

 κ––I⪯Hk⪯¯¯¯κI, (7)

where for two matrices and , means is positive semidefinite. It is important to note that is built with the information collected around , particularly, . Therefore, unlike stochastic quasi-Newton methods, is correlated with , and hence, is very difficult to analyze. Unfortunately, this difficulty prevents us from using the general framework given by Wang:2017.

To overcome this difficulty and carry on with the convergence analysis, we modify Algorithm 1 such that is calculated with a new independent mini batch, and therefore, it is independent of . By doing so, we still build a model using the information around . Assuming that

is an unbiased estimator of

, we conclude that . In the rest of this section, we provide a convergence analysis for this modified algorithm which we will call as SMBi (i for independent batch). The steps of SMBi are given in Algorithm 2. As Step 11 shows, we obtain the model building step with a new random batch.

Assumptions: Before providing the analysis, let us assume that is continuously differentiable, lower bounded by , and there exists such that for any , . We also assume that , , are independent samples and for any iteration , is independent of , and , for some .

In order to be in line with practical implementations and with our experiments, we first provide an analysis covering the constant stepsize case for (possibly) non-convex objective functions.

Below, we denote by the random samplings in the first iterations. Let be the maximum stepsize that is allowed in the implementation of SMBi with

 αmax≥−1+√1+16η24Lη. (8)

This hyper-parameter of maximum stepsize is needed in the theoretical results. The same parameter can also be used to apply automatic stepsize adjustment (see our numerical experiments with stepsize auto-scheduling in Section 3.2). Observe that since , assuming implies that it suffices to choose to satisfy (8). The proof of the following convergence result is given in Appendix Proof of Theorem 2

Suppose that our assumptions above hold and is generated by SMBi as given in Algorithm 2. Suppose also that in Algorithm 2 satisfies that for all . For given , let be a random variable with the probability mass function

 PR(k):=P{R=k}=αk/(η−1+2Lαmax)−α2kL/2∑Tk=1(αk/(η−1+2Lαmax)−α2kL/2),

for . Then, we have

 E[∥∇f(xR)∥2]≤Df+(σ2L/2)∑Tk=1(α2k/mk)∑Tk=1(αk/(η−1+2Lαmax)−α2kL/2),

where and the expectation is taken with respect to and . Moreover, if we choose and for all , then this reduces to

 E[∥∇f(xR)∥2]≤2L(η−1+2Lαmax)2DfT+M2m.

Using this theorem, it is possible to deduce that stochastic first-order oracle complexity of SMB with random output and constant stepsize is (Wang:2017, Corollary 2.12).

In (Wang:2017, Theorem 2.5), it is shown that under our assumptions above and the extra assumption of , if the point sequence is generated by SMBi method (when is calculated by an independent batch in each step) with batch size for all , then there exists a positive constant such that . Using this observation, the proof of Theorem 2, and Theorem 2.8 in (Wang:2017), we can also give the following complexity result when the stepsize sequence is diminishing for non-convex objective functions.

Let the batch size be and assume that with for all . Then generated by SMBi satisfies that

 1TT∑k=1E[∥∇f(xk)∥2≤2L(η−1+2Lαmax)(Mf−flow)Tϕ−1+M2(1−ϕ)m(T−ϕ−T−1)

for some , where denotes the iteration number. Moreover, for a given , to guarantee that , the number of iterations needed is at most .

We are now ready to assess the performance of SMB and SMBi with some numerical experiments.

## 3 Numerical Experiments

In this section, we compare SMB and SMBi against SGD, Adam Kingma:2015, and SLS (SGD+Armijo) Vas:2019. We have chosen SLS since it is a recent method that uses stochastic line search with backtracking. We have conducted experiments on multi-class classification problems using neural network models111The implementations of the models are taken from https://github.com/IssamLaradji/sls. Our Python package SMB along with the scripts to conduct our experiments are available online: https://github.com/sibirbil/SMB

### 3.1 Constant Stepsize

We start our experiments with constant stepsizes for all methods. We should point out that SLS method adjusts the stepsize after each backtracking process and also uses a stepsize reset algorithm between epochs. We refer to this routine as stepsize auto-scheduling. Therefore, we find it unfair to compare SLS with other methods with constant stepsize. Please, see Section

3.2 for a discussion about stepsize auto-scheduling using SMB.

#### MNIST dataset.

On the MNIST dataset, we have used the one hidden-layer multi-layer perceptron (MLP) of width 1,000. We compare all methods after cross-validating their best performances from the set of learning rates,

. For SMB and SLS, we have used the default hyper-parameter value of SLS that appears in the Armijo condition (also recommended by the authors of SLS)

In Figure 2, we see the best performances of all five methods on the MNIST dataset with respect to epochs and run time. The reported experiments consist of five independent runs where results are averaged. Even though SMB and SMBi may calculate an extra function value (forward pass) and a gradient (backward pass), we see in this problem that SMB and SMBi achieve the best performance with respect to the run time as well as the number of epochs. More importantly, the generalization performances of SMB and SMBi are also better than the remaining three methods.

It should be pointed out that, in practice, choosing a new independent batch means the SMBi method can construct a model step in two iteration using two batches. This way the computation cost for each iteration is reduced but the model steps can only be taken in half of the iterations in the epoch. As seen in Figure 2, this does not seem to effect the performance significantly.

#### CIFAR10 and CIFAR100 datasets.

For the CIFAR10 and CIFAR100 datasets, we have used the standard image-classification architectures ResNet-34 (He:2016) and DenseNet-121 (Huang:2017) . Due to the high computational costs of these architectures, we report the results of a single run of each method. For, Adam we have used the default learning rate 0.001, and for SGD, we have set the tuned learning rate to 0.1 as reported in Vas:2019. For SMB and SLS, we have again used the default learning rate of 1.0 and Armijo constant of SLS.

In Figure 3, we see that on CIFAR10-Resnet34 and CIFAR100-Resnet34, SMB performs better than Adam and SGD algorithms. However, its performance is only comparable to SLS. Even though SMB reaches a lower loss function value in CIFAR100-Resnet34, this advantage does not show in test accuracy. As mentioned in the beginning of this section, SLS method adjusts the stepsize after each backtracking process and, in order to prevent diminishing stepsizes, it uses a stepsize reset algorithm between epochs. SMB does not benefit from this kind of stepsize auto-scheduling. We will define an auto-scheduling for SMB stepsizes in Section 3.3 so that we obtain a fairer comparison between SMB and SLS.

In Figure 4, we see a comparison of performances of SMB and SLS on CIFAR100-DenseNet121. SMB with a constant stepsize outperforms SLS on train loss and reaches to high test accuracy before SLS. Vas:2019 show that SLS with these settings outperforms Adam and SGD on this problem both in terms of traning loss and test accuracy.

### 3.2 Stepsize Auto-Scheduling

As expected SMB can take many model steps, when learning rate is too large. Then, extra mini-batch function and gradient evaluations can slow down the algorithm (c.f., Figure 3). We believe that the number of model steps taken in an epoch (when the Armijo condition is not satisfied) can be a good measure to adjust the learning rate in the next epoch. This can lead to an automatic learning rate scheduling algorithm. We did preliminary experiments with a simple stepsize auto-scheduling routine, The results are given in Figure 5. At the end of each epoch, we multiply the stepsize by 0.9 when the model steps taken in an epoch is more than 5% of the total steps taken. Otherwise, we divide the stepsize by 0.9, unless the division ends up with a stepsize greater than the maximum stepsize allowed, . The value 0.9 is the backtracking ratio of SLS and we consider 5% as a hyper-parameter. Figure 5 shows, on the training loss, that both SMB and SMBi perform better than the other methods. For the test accuracy, SMB performs better than all other methods, and SMBi performs comparable to SLS.

### 3.3 Robustness with respect to Stepsize

Our last set of experiments are devoted to demonstrating the robustness of SMB. The preliminary results in Figure 6 show that SMB is more robust to the choice of the learning rate than Adam and SGD, especially in deep neural networks. This aspect of SMB needs more attention theoretically and experimentally.

## 4 Conclusion

SMB is a fast alternative to stochastic gradient method. The algorithm provides a model building approach that replaces the one-step backtracking in stochastic line search methods. We have analyzed the convergence properties of a modification of SMB by rewriting its model building step as a quasi-Newton update and constructing the scaling matrix with a new independent batch. Our numerical results have shown that SMB converges fast and its performance is much more insensitive to the selected stepsize than Adam and SGD algorithms. In its current state, SMB lacks any internal learning rate adjusting mechanism that could reset the learning rate depending on the progression of the iterations. As shown in Section 3.3, SMB can greatly benefit from a stepsize auto-scheduling routine. This is a future work that we will consider. Our convergence rate analysis is given for the alternative algorithm SMBi which can perform well agains other methods but consistently underperforms the original SMB method. This begs for a convergence analysis for the SMB method.

## Proof of Theorem 2

First we show that the SMB step for each parameter group can be expressed as a special quasi-Newton update. For brevity, let us use , , , , and instead of , , , , and , respectively. Recalling the definitions of and given in (5), observe that

 2δ=∥stk∥∥yk∥+1η∥stk∥∥gk∥−y⊤kstk=αk(∥gk∥∥yk∥+1η∥gk∥2+y⊤kgk)=αkσ,

and

 θ=(y⊤kstk+2δ)2−∥stk∥2∥yk∥2=α2k(σ−y⊤kgk)2−α2k∥gk∥2∥yk∥2=α2k(β2−∥gk∥2∥yk∥2)=α2kγ,

where

 σ=∥gk∥∥yk∥+1η∥gk∥2+y⊤kgk, β=σ−y⊤kgk, and γ=(β2−∥gk∥2∥yk∥2).

Therefore, we have

 cg(δ)gk=−∥stk∥22δgk=−α2k∥gk∥2αkσγγgk=−αk∥gk∥2σγγgk,
 cy(δ)yk =−∥stk∥22δθ[−(y⊤kstk+2δ)(stk)⊤gk+∥stk∥2y⊤kgk]yk =−∥gk∥2αkσγyk[α2k(σ−y⊤kgk)g⊤kgk+α2k∥gk∥2y⊤kgk] =−αk∥gk∥2σγ[βykg⊤k+∥gk∥2yky⊤k]gk,

and

 cs(δ)stk =−∥stk∥22δθ[−(y⊤kstk+2δ)y⊤kgk+∥yk∥2(stk)⊤gk]stk =−∥gk∥2αkσγ(−αk)gk[−αk(σ−y⊤kgk)y⊤kgk−αk∥yk∥2g⊤kgk] =−αk∥gk∥2σγ[βgky⊤k+∥yk∥2gkg⊤k]gk.

Now, it is easy to see that

 sk =cg(δ)gk+cy(δ)yk+cs(δ)stk =−αk∥gk∥2σγ[γI+βykg⊤k+∥gk∥2yky⊤k+βgky⊤k+∥yk∥2gkg⊤k]gk.

Thus, for each parameter group , we define

 Hk,p=∥gk,p∥2σpγp[γpI+βpyk,pg⊤k,p+∥gk,p∥2yk,py⊤k,p+βpgk,py⊤k,p+∥yk,p∥2gk,pg⊤k,p], (9)

where

 σp=∥gk,p∥∥yk,p∥+1η∥gk,p∥2+y⊤k,pgk,p, βp=σp−y⊤k,pgk,p, and γp=(β2p−∥gk,p∥2∥yk,p∥2).

Now, assuming that we have the parameter groups , the SMB steps can be expressed as a quasi-Newton update given by

 xk+1=xk−αkHkgk,

where

 Hk={I,if the Armijo condition is satisfied;diag(Hk,p1,…,Hk,pn),otherwise.

Here,

denotes the identity matrix, and

denotes the block diagonal matrix with the blocks .

We next show that the eigenvalues of the matrices

, , are bounded from above and below uniformly which is, of course, obvious when . Using the Sherman-Morrison formula twice, one can see that for each parameter group , the matrix is indeed the inverse of the positive semidefinite matrix

 Bk,p=1∥gk,p∥2(σpI−gk,py⊤k,p−yk,pg⊤k,p),

and hence, it is also positive semidefinite. Therefore, it is enough to show the boundedness of the eigenvalues of uniformly on and .

Since is a rank two matrix, is an eigenvalue of with multiplicity . The remaining extreme eigenvalues are

 λmax(Bk,p)=1∥gk,p∥2(σp+∥gk,p∥∥yk,p∥−y⊤k,pgk,p)   and   λmin(Bk,p)=1∥gk,p∥2(σp−∥gk,p∥∥yk,p∥−y⊤k,pgk,p),

with the corresponding eigenvectors

and , respectively.

Observe that,

 λmin(Bk,p) =σp−∥gk,p∥∥yk,p∥−y⊤k,pgk,p∥gk,p∥2 =∥gk,p∥∥yk,p∥+η−1∥gk,p∥2+y⊤k,pgk,p−∥gk,p∥∥yk,p∥−y⊤k,pgk,p∥gk,p∥2 =η−1∥gk,p∥2∥gk,p∥2=1η>1.

Thus, the smallest eigenvalue is bounded away from zero uniformly on and .

Now, by our assumption of Lipschitz continuity of the gradients, for any and , we have

 ∥g(x,ξk)−g(y,ξk)∥≤L∥x−y∥.

Thus, observing that , we have

 λmax(Bk,p) =σp+∥gk,p∥∥yk,p∥−y⊤k,pgk,p∥gk,p∥2 =∥gk,p∥∥yk,p∥+η−1∥gk,p∥2+y⊤k,pgk,p+∥gk,p∥∥yk,p∥−y⊤k,pgk,p∥gk,p∥2 =2∥gk,p∥∥yk,p∥+η−1∥gk,p∥2∥gk,p∥2≤2Lαk+1η≤2Lαmax+η−1.

This implies that the eigenvalues of are bounded below by and bounded above by 1 uniformly on and . This result, together with our assumptions, shows that steps of the SMBi algorithm satisfy the conditions of Theorem 2.10 in (Wang:2017) with and and Theorem 2 follows as a corollary.