Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

Authors

• 36 publications
• 26 publications
• Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

06/18/2018 ∙ by Jinghui Chen, et al. ∙ 4

• Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

There has been significant recent work on the theory and application of ...
06/01/2015 ∙ by Julie Nutini, et al. ∙ 0

• Gradient Descent Converges to Ridgelet Spectrum

Deep learning achieves a high generalization performance in practice, de...
07/07/2020 ∙ by Sho Sonoda, et al. ∙ 0

• Constrained Deep Learning using Conditional Gradient and Applications in Computer Vision

A number of results have recently demonstrated the benefits of incorpora...
03/17/2018 ∙ by Sathya N. Ravi, et al. ∙ 0

Adaptive learning rates can lead to faster convergence and better final ...
08/17/2020 ∙ by Renlong Jie, et al. ∙ 38

• Explaining generalization in deep learning: progress and fundamental limits

This dissertation studies a fundamental open challenge in deep learning ...
10/17/2021 ∙ by Vaishnavh Nagarajan, et al. ∙ 0

• Experiential Robot Learning with Accelerated Neuroevolution

Derivative-based optimization techniques such as Stochastic Gradient Des...
08/16/2018 ∙ by Ahmed Aly, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep networks have achieved excellent performance in a variety of domains such as computer vision

he2016deep , language modeling zaremba2014recurrent , and speech recognition graves2013speech . The most popular optimizer is stochastic gradient decent (SGD) robbins1951stochastic , which is simple and has low per-iteration complexity. Its convergence rate is also well-established ghadimi2013stochastic ; bottou2018optimization . However, vanilla SGD is sensitive to the choice of stepsize, and requires careful tuning. To improve the efficiency and robustness of SGD, many variants have been proposed, such as momentum acceleration polyak1964some ; nesterov1983method ; sutskever2013importance and adaptive stepsizes duchi2011adaptive ; hinton2012rmsprop ; zeiler2012adadelta ; kingma2014adam .

Though variants with coordinate-wise adaptive stepsize (such as Adam duchi2011adaptive ) have shown to be effective in accelerating convergence, their generalization performance is often worse than SGD wilson2017marginal . To improve generalization performance, attempts have been made to use a layer-wise stepsize singh2015layer ; yang2017lars ; adam2017normalized ; zhou2018adashift , which assign different stepsizes to different layers or normalize the layer-wise gradient. However, there has been no theoretical analysis for its empirical success. More generally, the whole network parameter can also be partitioned into blocks instead of simply into layers.

Recently, it is shown that coordinate-wise adaptive gradient descent is closely related to sign-based gradient descent pmlr-v80-balles18a ; pmlr-v80-bernstein18a . Theoretical arguments and empirical evidence suggest that the gradient sign would impede generalization pmlr-v80-balles18a . To contract the generalization gap, a partial adaptive parameter for the second-order momentum is proposed chen2018closing . By using a smaller partial adaptive parameter, the adaptive gradient algorithm behaves less like sign descent and more like SGD.

Moreover, in methods with coordinate-wise adaptive stepsize, a small () parameter is typically used to avoid numerical problems in practical implementation. It is discussed in zaheer2018adaptive that this parameter controls adaptivity of the algorithm, and using a larger value (say, ) can reduce adaptivity and empirically helps Adam to match its generalization performance with SGD. This implies that coordinate-wise adaptivity may be too strong for good generalization performance.

 minθF(θ)=Ez[f(θ;z)], (1)

where

is some possibly nonconvex loss function, and

is a random sample. The expected risk measures the generalization performance on unseen data bottou2018optimization , and reduces to the empirical risk when a finite training set is considered. We show theoretically that the proposed blockwise adaptive gradient descent can be faster than its counterpart with coordinate-wise adaptive stepsize. Using tools on uniform stability bousquet2002stability ; hardt2016train , we also show that blockwise adaptivity has potentially lower generalization error than coordinate-wise adaptivity. Empirically, blockwise adaptive gradient descent converges faster and obtains better generalization performance than its coordinate-wise counterpart (Adam) and Nesterov’s accelerated gradient (NAG) sutskever2013importance .

Notations. For an integer ,

. For a vector

, denotes its transpose, is a diagonal matrix with on its diagonal, is the element-wise square root of , is the coordinate-wise square of , , , , where is a positive semidefinite (psd) matrix, and means for all . For two vectors and , , and denote the element-wise division and dot product, respectively. For a square matrix , is its inverse, and means that is psd. Moreover, .

2 Related Work

Recall that the SGD iterate is the solution to the problem: , where is the gradient of the loss function at iteration , and is the parameter vector. To incorporate information about the curvature of sequence , the -norm in the SGD update can be replaced by the Mahalanobis norm, leading to duchi2011adaptive :

 θt+1=argminθ⟨gt,θ⟩+12η∥θ−θt∥2Diag(st)−1, (2)

where . This is an instance of mirror descent nemirovsky1983problem . Its regret bound has a gradient-related term . Adagrad’s stepsize can be obtained by examining a similar objective duchi2011adaptive :

 mins∈ST∑t=1∥gt∥2Diag(s)−1, (3)

where , and is some constant. At optimality, , where . As cannot depend on ’s with , this suggests . Theoretically, this choice of leads to a regret bound that is competitive with the best post-hoc optimal bound mcmahan2010adaptive .

To solve the expected risk minimization problem in (1), an Adagrad variant called weighted AdaEMA is recently proposed in zou2018sufficient . It employs weighted averaging of ’s for stepsize and momentum acceleration. This is a general coordinate-wise adaptive method and includes many Adagrad variants as special cases, including Adam and RMSprop.

Let be the sample size, be the input dimensionality, and be the output dimensionality. Consider a -layer neural network, with output , where is the input matrix and are the weight matrices with and

. The activation functions

are assumed to be bijective (e.g., tanh and leaky ReLU). For simplicity, assume that

for all . Training this neural network with the square loss corresponds to solving the nonlinear optimization problem: , where is the label matrix. Consider training the network layer-by-layer, starting from the bottom one. For layer , , where is a stochastic gradient evaluated at at time , and is the stepsize which may be adaptive in that it depends on . This layer-wise training is analogous to block coordinate descent, with each layer being a block. The optimization subproblem for the th layer can be rewritten as

 minWl∥Φl(Hl−1Wl)−Y∥22, (4)

where ,

is the input hidden representation of

at the th layer, and .

Proposition 1.

Assume that ’s (with ) are invertible. If is initialized to zero, and has full row rank, then the critical point that it converges to is also the minimum -norm solution of (4) in expectation.

As stepsize can depend on , Proposition 1 shows that blockwise adaptivity can find the minimum -norm solution of (4). In contrast, coordinate-wise adaptivity fails to find the minimum -norm solution even for the underdetermined linear least squares problem wilson2017marginal . Another benefit of using a blockwise stepsize is that the optimizer’s extra memory cost can be reduced. Using a coordinate-wise stepsize requires an additional

memory for storing estimates of the second moment, while the blockwise stepsize only needs an extra

memory, where is the number of blocks. A deep network generally has millions of parameters but only tens of layers. If we set to be the number of layers, memory reduction can be significant.

There have been some recent attempts on the use of layer-wise stepsize in deep networks, either by assigning a specific adaptive stepsize to each layer or normalizing the layer-wise gradient singh2015layer ; yang2017lars ; adam2017normalized ; zhou2018adashift . However, justifications and convergence analysis are still lacking.

3.2 Blockwise Adaptive Learning Rate with Momentum

Let the gradient be partitioned to , where is the set of indices in block , and is the subvector of belonging to block . Inspired by problem (3) in the derivation of Adagrad, we consider the following variant which imposes a block structure on :111We assume the indices in are consecutive; otherwise, we can simply reorder the elements of the gradient. Note that reordering does not change the result, as the objective is invariant to ordering of the coordinates.

 mins∈S′T∑t=1∥gt∥2% Diag(s)−1, (5)

where for some . It can be easily shown that at optimality, , where . The optimal is thus proportional to . When in (2) is partitioned by the same block structure, the optimal suggests to incorporate into for block at time . Thus, we consider the following update rule with blockwise adaptive stepsize:

 vt,b = vt−1,b+∥gt,Gb∥22/db, (6) θt+1,Gb = θt,Gb−ηtgt,Gb/(√vt,b+ϵ), (7)

where

is a hyperparameter that prevents numerical issues. When

’s assign different weights to the past gradients in the accumulation of variance, as:

 ^vt,b=t∑i=1aiAt∥gi,Gb∥22db=1∑tj=1ajt∑i=1ai∥gi,Gb∥22db. (8)

In this paper, we consider the three weight sequences introduced in zou2018convergence . S.1: for some ; S.2: for some : The fraction in (8) then decreases as . S.3: for some : It can be shown that this is equivalent to using the exponential moving average estimate: , and . When , , and , the proposed algorithm reduces to Adam.

3.3 Convergence Analysis

We make the following assumptions.

Assumption 1.

in (1) is lower-bounded (i.e., ) and -smooth.

Assumption 2.

Each block of stochastic gradient has bounded second moment, i.e., , where the expectation is taken w.r.t. the random .

Assumption 2 implies that the variance of each block of stochastic gradient is upper-bounded by (i.e., ).

We make the following assumption on sequence . This implies that we can use, for example, a constant , or an increasing sequence .

for some .

Assumption 4.

(i) is non-decreasing; (ii) grows slowly such that is non-decreasing and for some ; (iii) .

This is satisfied by sequences 3.2 (with and ), 3.2 (with and ), and 3.2 (with and ).

Assumption 5.

zou2018sufficient The stepsize is chosen such that is “almost" non-increasing, i.e., there exists a non-increasing sequence and positive constants and such that for all .

Assumption 5 is satisfied by the example sequences 3.2, 3.2, 3.2 when for some . Interested readers are referred to zou2018sufficient for details.

As in weighted AdaEMA, we define a sequence of virtual estimates of the second moment: . Let be its maximum over all blocks and training iterations, where the expectation is taken over all random ’s. Let for and . For a constant such that , define , where is the largest index for which . When , we set .

The following Theorem provides a bound related to the gradients.

Theorem 1.

Suppose that Assumptions 1-5 hold. Let . We have , where222When , the second term in (involving summation from to ) disappears. , , , , and .

When , the bound here is tighter than that in zou2018sufficient , as we exploit heterogeneous second-order upper bound (Assumption 2

). The following Corollary shows the bound with high probability.

Corollary 1.

With probability at least , we have .

Corollary 2.

Let , and for some positive constant . When for some (which holds for sequences 3.2 and 3.2), . When for some (sequence 3.2), .

When , we obtain the same non-asymptotic convergence rates as in zou2018sufficient . Note that SGD is analogous to BAGM with , as they both use a single stepsize for all coordinates and the convergence rates depend on the same second-order moment upper bound in Assumption 2. With a decreasing stepsize, SGD also has a convergence rate of , which can be seen by setting their stepsize to in (2.4) of ghadimi2013stochastic . Thus, our rate is as good as SGD.

Next, we compare the effect of on convergence. As in depends on the sequence , a direct comparison is difficult. Instead, we study an upper bound looser than . First, we introduce the following assumption, which is stronger than Assumption 2 (that only bounds the expectation).

Assumption 6.

and .

With Assumption 6, it can be easily shown that . We can then define a looser upper bound by replacing in with . We proceed to compare the convergence using coordinate-wise stepsize (with ) and blockwise stepsize (with for some ). Note that when , Assumption 6 becomes for some , and Assumption 2 becomes for some . When , we assume that Assumption 2 is tight in the sense that ,333Note that . On the other hand, . Thus, this bound is tight in the sense that . where is the set of indices in block . The following Corollary shows that blockwise stepsize can have faster convergence than coordinate-wise stepsize.

Corollary 3.

Assume that Assumption 6 holds. Let and be the values of for and , respectively. Define , and . Let . Then, .

Note that can be larger than as . Corollary 3 then indicates that blockwise adaptive stepsize will lead to improvement if . Assume that the upper bound is tight so that . Thus, , and the above condition is likely to hold when is close to . From the definitions of , and , we can see that they get close to when are close to (i.e., has low variability). In particular, when for all (note that ). This is empirically verified in Appendix C.2.1.

3.4 Uniform Stability and Generalization Error

Given a sample of examples drawn i.i.d. from an underlying unknown data distribution , one often learns the model by minimizing the empirical risk: , where is the output of a possibly randomized algorithm (e.g., SGD) running on data .

Definition 1.

hardt2016train Let and be two samples of size that differ in only one example. Algorithm is -uniformly stable if .

The generalization error hardt2016train is defined as , where the expectation is taken w.r.t. the sample and randomness of . It is shown that the generalization error is bounded by the uniform stability of , i.e., hardt2016train . In other words, the more uniformly stable an algorithm is, the lower is its generalization error.

Let , and , where are the th iterates of BAGM on and , respectively. The following shows how (uniform stability) grows with .

Proposition 2.

Assume that is -Lipschitz444In other words, for any .. Suppose that Assumptions 1-5 hold, , and . For any , we have where .

Using Proposition 2, we can study how affects the growth of . Consider the first term on the RHS of the bound. Recall that . If , this term is smallest when ; otherwise, some will make this term smallest. For the term, as , the term inside is typically the smallest when , and is largest when . Thus, the first term of the bound is small when is close to , while is small when approaches . As a result, for equals to some , , and thus the generalization error, grows slower than those of and .

4 Experiments

In this section, we perform experiments on CIFAR-10 (Section 4.1

), ImageNet (Section

4.2), and WikiText-2 (Section 4.3). All the experiments are run on a AWS p3.16 instance with 8 NVIDIA V100 GPUs. We introduce four block construction strategies: B.1

: Use a single adaptive stepsize for each parameter tensor/matrix/vector. A parameter tensor can be the kernel tensor in a convolution layer, a parameter matrix can be the weight matrix in a fully-connected layer, and a parameter vector can be a bias vector;

B.2: Use an adaptive stepsize for each output dimension of the parameter matrix/vector in a fully connected layer, and an adaptive stepsize for each output channel in the convolution layer; B.3: Use an adaptive stepsize for each output dimension of the parameter matrix/vector in a fully connected layer, and an adaptive stepsize for each kernel in the convolution layer; B.4: Use an adaptive stepsize for each input dimension of the parameter tensor/matrix, and an adaptive stepsize for each parameter vector.

We compare the proposed BAGM (with block construction approaches 4, 4, 4, 4) with the following baselines: (i) Nesterov’s accelerated gradient (NAG) sutskever2013importance ; and (ii) Adam kingma2014adam . These two algorithms are widely applied in deep networks zaremba2014recurrent ; he2016deep ; vaswani2017attention . NAG provides a strong baseline with good generalization performance, while Adam serves as a fast counterpart with coordinate-wise adaptive stepsize.

As grid search for all hyper-parameters is very computationally expensive, we only tune the most important ones using a validation set and fix the rest. We use a constant (momentum parameter) and exponential increasing sequence 3.2 with for BAGM. For Adam, we also fix its second moment parameter to and tune its momentum parameter. Note that with such configurations, Adam is a special case of BAGM with (i.e., weighted AdaEMA). For all the adaptive methods, we use as suggested in zaheer2018adaptive .

4.1 ResNet on CIFAR-10

We train a deep residual network from the MXNet Gluon CV model zoo on the CIFAR-10 data set. We use the 56-layer and 110-layer networks as in he2016deep . 10% of the training data are carved out as validation set. We perform grid search using the validation set for the initial stepsize and momentum parameter on ResNet56. The obtained hyperparameters are then also used on ResNet110. We follow the similar setup as in he2016deep . Details are in Appendix C.2.

Table 1 shows the testing errors of the various methods. With a large , the testing performance of Adam matches that of NAG. This agrees with zaheer2018adaptive that a larger reduces adaptivity and improves generalization performance. It also agrees with Proposition 2 that the bound is smaller when is larger. Specifically, Adam has lower testing error than NAG on ResNet56 but higher on ResNet110. For both models, BAGM reduces the testing error over Adam for all block construction strategies used. In particular, except 4, BAGM with all other schemes outperform NAG.

Convergence of the training, testing, and generalization errors (absolute difference between training error and testing error) are shown in Figure 1.666To reduce clutterness, we only show results of the block construction scheme BAGM-4, which gives the lowest testing error among the proposed block schemes. Figure with the full results is shown in Appendix C.2. As can be seen, on both models, BAGM-4 converges to a lower training error rate than Adam. This agrees with Corollary 3 that blockwise adaptive methods can have faster convergence than their counterparts with element-wise adaptivity. Moreover, the generalization error of BAGM-4 is smaller than Adam, which agrees with Proposition 2 that blockwise adaptivity can have a slower growth of generalization error. On both models, BAGM-4 gives the smallest generalization error, while NAG has the highest generalization error on ResNet56. Overall, the proposed methods can accelerate convergence and improve generalization performance.

4.2 ImageNet Classification

In this experiment, we train a 50-layer ResNet model on ImageNet russakovsky2015imagenet . The data set has 1000 classes, 1.28M training samples, and 50,000 validation images. As the data set does not come with labels for its test set, we evaluate its generalization performance on the validation set. We use the ResNet50_v1d network from the MXNet Gluon CV model zoo. We train the FP16 (half precision) model on 8 GPUs, each of which processes 128 images in each iteration. More details are in Appendix C.3.

Performance on the validation set is shown in Table 1. As can be seen, BAGM with all the block schemes (particularly BAGM-4) achieve lower top-1 errors than Adam and NAG. As for the top-5 error, BAGM-4 obtains the lowest, which is then followed by BAGM-4. Overall, BAGM-4 has the best performance on both CIFAR-10 and ImageNet.

4.3 Word-Level Language Modeling

In this section, we train the AWD-LSTM word-level language model merity2017regularizing on the WikiText-2 (WT2) data set merity2016pointer . We use the publicly available implementation in the Gluon NLP toolkit. We perform grid search on the initial learning rate and momentum parameter as in Section 4.1, and set the weight decay to as in merity2017regularizing . More details on the setup are in Appendix C.4. As there is no convolutional layer, 4 and 4 are the same. Table 2 shows the testing perplexities, the lower the better. As can be seen, all adaptive methods achieve lower test perplexities than NAG, and BAGM-4 obtains the best result.

5 Conclusion

In this paper, we proposed adapting the stepsize for each parameter block, instead of for each individual parameter as in Adam and RMSprop. Convergence and uniform stability analysis shows that it can have faster convergence and lower generalization error than its counterpart with coordinate-wise adaptive stepsize. Experiments on image classification and language modeling confirm these theoretical results.

Appendix A Online Convex Learning

In online learning, the learner picks a prediction at round , and then suffers a loss . The goal of the learner is to choose and achieve a low regret w.r.t. an optimal predictor in hindsight. The regret (over rounds) is defined as

 R(T)≡T∑t=1ft(θt)−infθT∑t=1ft(θ). (9)

a.1 Proposed Algorithm

The proposed procedure, which will be called blockwise adaptive gradient (BAG), is shown in Algorithm

Remark 1.

When (i.e., each block has only one coordinate), Algorithm 2 reduces to Adagrad. When (i.e., all coordinates are grouped together), Algorithm 2 produces the update: with a global adaptive learning rate.

a.2 Regret Analysis

First, we make the following assumptions.

Assumption 7.

Each in (9) is convex but possibly nonsmooth. There exists a subgradient such that for all .

Assumption 8.

Each parameter block is in a ball of the corresponding optimal block throughout the iterations. In other words, for all , where is the subvector of for block .

Theorem 2.

Suppose that Assumptions 7 and 8 hold. Then,

 R(T)≤B∑b=1[12η√dbD2b+η√db]∥g1:T,Gb∥2. (10)
Remark 2.

When , by setting for all , where is some constant such that , the regret bound reduces to that of Adagrad in Theorem 5 of [8].

By Jensen’s inequality, the last term of (10) is minimized when . However, the comparison with Adagrad is indeterminate in the first term due to the constant .

In the following, we provide an example showing that when gradient magnitudes for elements in the same block have the same upper bound, blockwise adaptive learning rate can lead to lower regret than coordinate-wise adaptive learning rate (in Adagrad). This then indicates that blockwise adaptive method can potentially be beneficial in training deep networks, as its architecture can be naturally divided into blocks and parameters in the same block are likely to have gradients with similar magnitudes.

Let be the hinge loss for a linear model:

 ft(θt)=max(0,1−yt⟨θt,xt⟩), (11)

where is the label and is the feature vector. Assume that input is partitioned into blocks. For each in input block , with probability , for some given , and otherwise. Then, , and the expected gradient magnitudes for elements in the same input block have the same upper bound. Taking expectation of the gradient terms in (10), we have, for all ’s,

 E[∥g1:T,Gb∥2]≤ ⎷∑