NAMSG: An Efficient Method For Training Neural Networks

by   Yushu Chen, et al.
Tsinghua University
NetEase, Inc

We introduce NAMSG, an adaptive first-order algorithm for training neural networks. The method is efficient in computation and memory, and straightforward to implement. It computes the gradients at configurable remote observation points, in order to expedite the convergence by adjusting the step size for directions with different curvatures, in the stochastic setting. It also scales the updating vector elementwise by a nonincreasing preconditioner, to take the advantages of AMSGRAD. We analyze the convergence properties for both convex and nonconvex problems, by modeling the training process as a dynamic system, and provide a guideline to select the observation distance without grid search. We also propose a datadependent regret bound, which guarantees the convergence in the convex setting. Experiments demonstrate that NAMSG works well in practice and compares favorably to popular adaptive methods, such as ADAM, NADAM, and AMSGRAD.


page 1

page 2

page 3

page 4


Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Non Convex Optimization

Although ADAM is a very popular algorithm for optimizing the weights of ...

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

In this paper, we propose a generic and simple algorithmic framework for...

A Simple and Efficient Stochastic Rounding Method for Training Neural Networks in Low Precision

Conventional stochastic rounding (CSR) is widely employed in the trainin...

Adam revisited: a weighted past gradients perspective

Adaptive learning rate methods have been successfully applied in many fi...

Training Neural Networks Using Features Replay

Training a neural network using backpropagation algorithm requires passi...

A Dynamic Sampling Adaptive-SGD Method for Machine Learning

We propose a stochastic optimization method for minimizing loss function...

BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods to Deep Binary Model

Recent methods have significantly reduced the performance degradation of...

1 Introduction and related work

Training deep neural networks [Collobert et al., 2011, Hinton et al., 2012, Amodei et al., 2016, He et al., 2016] with large datasets costs a huge amount of time and computational resources. Efficient optimization methods are urgently required to accelerate the training process.

First-order optimization methods [Robbins and Monro, 1951, Polyak, 1964, Bottou, 2010, Sutskever et al., 2013, Kingma and Ba, 2015] are currently the most popular for training neural networks. These methods are easy to implement since only first-order gradients are introduced as input. Besides, they require low computation overheads except for computing gradients, which is of the same computational complexity as just evaluating the function. Compared with second-order methods [Nocedal, 1980, Martens, 2010, Byrd et al., 2016], they are more effective to handle gradient noise. Moreover, the noise induced by the varying minibatches may even help to escape saddle points [Ge et al., 2015].

Sutskever et al. [2013] show that the momentum is crucial to improve the performance of SGD. Momentum methods, such as HB Polyak [1964], can amplify steps in low-curvature eigen-directions of the Hessian through accumulation, although careful tuning is required to ensure fine convergence along the high-curvature directions. Sutskever et al. [2013] also rewrite the Nesterov’s Accelerated Gradient (NAG) [Nesterov, 1983] in a momentum form, and show the performance improvement over HB. The method computes the gradient at a observation point ahead of the current point along the last updating direction. They illustrate that NAG suppresses the step along high curvature eigen-directions in order to prevent oscillations. However, all these approaches are approximation of their original forms derived for exact gradients, without fully study on gradient noise. Kidambi et al. [2018] show the insufficiency of HB and NAG in stochastic optimization, especially for small minibatches.

Among variants of SGD methods, adaptive methods that scale the gradient elementwise by some form of averaging of the past gradients are particularly successful. ADAGRAD [Duchi et al., 2011]

is the first popular method in this line. It is well-suited for sparse gradients since it uses all the past gradients to scale the update. Nevertheless, it suffers from rapid decay of step sizes, in cases of nonconvex loss functions or dense gradients. Subsequent adaptive methods, such as RMSPROP

[Tieleman and Hinton., 2012], ADADELTA [Zeiler, 2012], ADAM [Kingma and Ba, 2015], and NADAM [Dozat, 2016], mitigate this problem by using the exponential moving averages of squared past gradients. However, Reddi et al. [2018] show that ADAM does not converge to optimal solutions in some convex problems, and the analysis extends to RMSPROP, ADADELTA, and NADAM. They propose AMSGRAD, which fixes the problem and shows improvements in experiments.

In this paper, we propose an efficient method for training neural networks (NAMSG), which only requires first-order gradients. The name NAMSG is derived from combining the advantages of NAG and AMSGRAD. The method computes the stochastic gradients at observation points ahead of the current parameters along the last updating direction, which is similar to Nesterov’s acceleration. Nevertheless, instead of approximating NAG for exact gradients, it expedites convergence in the stochastic setting through adjusting the learning rates for eigen-directions with different curvatures, with configurable observation distance. It also scales the update vector elementwise using the nonincreasing preconditioner inherited from AMSGRAD. We analyze the convergence properties by modeling the training process as a dynamic system, reveal the benefits of remote gradient observations, and provide a guideline to select the observation distance without grid search. A regret bound of NAMSG is introduced in the convex setting, which guarantees the convergence. Finally, we present experiments to demonstrate the efficiency of NAMSG in real problems.

2 The NAMSG scheme

In this section, we present the NAMSG scheme by incorporating configurable remote gradient observations into AMSGRAD, to expedite convergence through introducing predictive information of the next update. The selection of observation distance will be further analyzed in the next section.

Before further description, we introduce the notations following Reddi et al. [2018], with slight abuse of notation. The letter denotes iteration number, denotes the dimension of vectors and matrices, denotes a predefined positive small value, and denotes the set of all positive definite matrix. For a vector and a matrices , we use to denote , to denote a square diagonal matrix with the elements of on the main diagonal, to denote the row of , and to denote . For any vectors , we use for elementwise square root, for elementwise square, for elementwise division, and to denote elementwise maximum. For any vector , denotes its coordinate where . We define as the feasible set of points. Assume that has bounded diameter , i.e. for any , and for all . The projection operation is defined as for and .

In the context of machine learning, we consider the minimization problem of a stochastic function,


where is a dimensional vector consisting of the parameters of the model, and is a random datum consisting of an input-output pair. Since the distribution of is generally unavailable, the optimizing problem (1) is approximated by minimizing the empirical risk on the training set , as


In order to save computation and avoid overfitting, it is common to estimate the objective function and its gradient with a minibatch of training data, as


where the minibatch , and is the size of .

The AMSGRAD update [Reddi et al., 2018] can be written as


where , , , and are configurable coefficients, , , and .

Since the updating directions are partially maintained in momentum methods, gradients computed at observation points, which lie ahead of the current point along the last updating direction, contain the predictive information of the forthcoming update. The remote observation points are defined as


where is the updating vector, and . The observation distance can be configured to accommodate gradient noise, instead of in NAG [Sutskever et al., 2013].

By computing the gradient at the observation point , and substituting the current gradient with the observation gradient in update (4), we obtain the original form of NAMSG method, as


where , , , , and are configurable coefficients, , , and .

Both and are required to update in (6). In order to make the method more efficient, we simplify the update by approximation. We ignore the projection and substitute by considering that provides a theoretical bound, whose boundary is generally far away from the parameters. Assuming that is close to 1, we neglect the difference between and . We also assume that the coefficients , , and , change very slowly between adjacent iterations. Then (6) is rewritten as Algorithm 1, which is named as NAMSG111For convenience in the convergence analysis, we use to denote the observation parameter vector instead of in Algorithm 1. Good default constant hyper-parameters settings for the tested machine learning problems are , , , and .. Compared with AMSGRAD, NAMSG requires low computation overheads, as a scalar vector multiplication and a vector addiction per iteration, which are much cheaper than the gradient estimation. Almost no more memory is needed if the vector operations are run by pipelines. In most cases, especially when weight decay is applied for regularization, which limits the norm of the parameter vectors, the projection can also be omitted in implementation to save computation.

Input: initial parameter vector , step size , coefficients , , , iteration number
Output: parameter vector

1:  Set , , and .
2:  for  to  do
3:     .
4:     .
5:     .
6:     .
7:     , where .
8:     .
9:  end for
Algorithm 1 NAMSG Algorithm

3 An analysis on the effect of remote gradient observations

In Algorithm 1, the observation distance is configurable to accelerate convergence. However, it is costly to select it by grid search. In this section we analyze the convergence rate in a local stochastic quadratic optimization setting by investigating the optimizing process as a dynamic system, and reveal the effect of remote gradient observation for both convex and non-convex problems. Based on the analysis, we provide a practical guideline to set the observation distance without grid search.

The problem (1) can be approximated locally as a stochastic quadratic optimization problem, as


where is a local set of feasible parameter points. In the problem, the gradient observation is noisy as , where is the gradient noise.

We consider the original form (6) of NAMSG, and ignore the projections for simplicity. Since varies slowly when is large, we can ignore the change of between recent iterations. The operation of dividing the update by can be approximated by solving a preconditioned problem, as


where , . Define the preconditioned Hessian , which is supposed to have improved condition number compared with , in the convex setting.

Then, we investigate the optimization process by modeling it as a dynamic system. Solving the quadratic problem (7) by NAMSG is equal to solving the preconditioned problem (8) by a momentum method with remote gradient observations, as


where the preconditioned stochastic function , the initial momentum , the coefficients , , and are considered as constants.

We use

to denote a unit eigenvector of the Hessian

, and the corresponding eigenvalue is

. We define the coefficients as , . According to (9), the coefficients are updated as


where the gradient error coefficient .

We further rewrite the update (10) into a dynamic system consisting of the series , as


We assume that

obeys the normal distribution with mean 0 and standard deviation

, and define the noise level , and . Then the eigenvalues of update (11) are


where obeys the standard normal distribution.

We further define the max gain expectation as


that is an upper bound of the expectation of logarithmic convergence rate in the dynamic system (11). In the sense of expectation, guarantees convergence along an eigen-direction, while may lead to divergence. for is required to escape saddle points in nonconvex problems.

Figure 1 presents the max gain expectation for different noise level and eigenvalues obtained by numerical integration, where the momentum coefficient . Figure 1 (a) shows the case of high curvature eigen-directions, in which a proper observation distance () accelerates the convergence, while a large may cause divergence. Figure 1 (b) shows the case of low curvature eigen-directions. A positive accelerates the convergence for relatively high noise level , which is the common case since is small. Figure 1 (c) and (d) show the case of small negative eigenvalues in nonconvex problems. It is observed that momentum methods may be trapped by saddle points with small negative eigenvalues, and a large can mitigate the problem. The results also verify that large gradient noise is helpful to escape saddle points. Figure 1 (e) shows that a moderate observation distance (such as ) results in a large convergence domain, allowing the convergence for high curvature eigen-directions. Meanwhile, a large accelerates the convergence for low curvature eigen-directions, which is the key difficulty in training. However, the maximum eigenvalue allowed for convergence decreases rapidly as increases, that prohibits too large .

Figure 1: Max gain expectation for each iteration: (a) , (b) , (c) , (d) , (e) an overview.

The analysis provides a guideline to select the observation distance without grid search. A moderate initial value of is suggested to avoid oscillations along the high curvature directions. Increase gradually to improve the performance along the low-curvature or negative-curvature directions, and then keep it constant. For example, we set at the beginning, and increase it linearly to or

in several epochs. The analysis may also be useful to the selection of other hyper-parameters.

4 Convergence Analysis

In this section, we analyze the convergence properties of NAMSG in the convex setting, and show a data dependent regret bound.

Since the sequence of cost functions are stochastic, we evaluate the convergence property of our algorithm by regret, which is the sum of all the previous difference between the online prediction and the best fixed point parameter for all the previous steps, defined as . When the regret of an algorithm satisfies , the algorithm converges to the optimal parameters on average.

Assuming that , NAMSG insures . The positive definiteness of results in a nonincreasing step size and avoids the non-convergence of ADAM. Following Reddi et al. [2018], we derive the following key results for NAMSG.

Theorem 1. Let and be the sequences obtained from Algorithm 1, , , , , , , for all , and . For generated using the NAMSG (Algorithm 1), we have the following bound on the regret


The proof is given in the supplementary materials, which can be downloaded at

By comparing the regret bound of AMSGRAD [Reddi et al., 2018], as


we find that the regret bounds of the two methods have the similar form. However, when is close to 1, which is the typical situation, NAMSG has lower coefficients on all of the 3 terms.

From Theorem 1, we can immediately obtain the following corollary.

Corollary 1 Suppose , then we have


The bound in Corollary 1 is considerably better than regret of SGD when and [Duchi et al., 2011]. It should be noted that although the proof requires a decreasing schedule of and to ensure convergence, we typically use constant and in practice, which improves the convergence speed in the experiments.

5 Expreriments

In this section, we present experiments to evaluate the performance of NAMSG, compared with SGD with momentum [Polyak, 1964] and popular adaptive stochastic optimization methods, such as ADAM [Kingma and Ba, 2015], NADAM [Dozat, 2016], and AMSGRAD [Reddi et al., 2018]

. We study logistic regression and neural networks for multiclass classification, representing convex and nonconvex settings, respectively. The experiments are carried out with MXNET

[Chen et al., 2015].

5.1 Experiments on MNIST

We compare the performance of SGD, ADAM, NADAM, AMSGRAD, and NAMSG for training logistic regression and neural network on the MNIST dataset [LeCun et al., 1998]. The dataset consists of 60k training images and 10k testing images in 10 classes. The image size is .

Logistic regression:In the experiment, the hyper-parameters for all the methods are set as follows: The step size parameter , the coefficients , is chosen from , , and the minibatch size is . and are chosen by grid search (see supplementary materials), and the best results in training are reported. In NAMSG, the observation distance is set without grid search. According to the guideline in Section 3, it increases from to linearly in the first epoch. We report the train and test results in Figure 2, which are the average of 10 runs. It is observed that NAMSG performs the best with respect to train loss and accuracy. The test loss and accuracy are roughly consistent with the train loss in the initial epochs, after which they increase for overfitting. The experiment shows that NAMSG achieves fast convergence in the convex setting.

Neural networks:

In the experiment, we train a simple convolutional neural network (CNN) for the multiclass classification problem on MNIST. The architecture has two

convolutional layers, with 20 and 50 outputs. Each convolutional layer is followed by Batch Normalization (BN)

[Ioffe and Szegedy, 2015] and a max pooling. The network ends with a 500-way fully-connected layer with BN and ReLU [Nair and Hinton, 2010], a 10-way fully-connected layer, and softmax. The hyper-parameters are set in a way similar to the previous experiment. The results are also reported in Figure 2, which are the average of 10 runs. We can see that NAMSG has the lowest train loss, which translates to good generalization performance before overfitting. The performance of AMSGRAD is close to NAMSG, and better than other methods. ADAM and NADAM are faster than SGD in the initial epochs, but they require lower learning rates in the final epochs to further reduce the train loss. The experiment shows that NAMSG is also efficient in non-convex problems.

Figure 2: Performance of SGD, ADAM, NADAM, AMSGRAD, and NAMSG on MNIST. The top row shows the performance in logistic regression, and the bottom row compares the results in CNN.

5.2 Experiments on CIFAR-10

In the experiment, we train Resnet-20 [He et al., 2016] on the CIFAR-10 dataset [Krizhevsky, 2009], that consists of 50k training images and 10k testing images in 10 classes. The image size is .

The architecture of the network is as follows: In training, the network inputs are images randomly cropped from the original images or their horizontal flips to save computation. The inputs are subtracted by the global mean and divided by the standard deviation. The first layer is convolutions. Then we use a stack of 18 layers with convolutions on the feature maps of sizes respectively, with 6 layers for each feature map size. The numbers of filters are respectively. A shortcut connection is added to each pair of

filters. The subsampling is performed by convolutions with a stride of 2. Batch normalization is adopted right after each convolution and before the ReLU activation. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. In testing, the original

images are used as inputs.

We train Resnet-20 on CIFAR-10 using SGD, ADAM, NADAM, AMSGRAD, and NAMSG. The training for each network runs for 75 epochs. The hyper-parameters are selected in a way similar to the previous experiments, excepting that we divide the constant step size by 10 at the iteration (in the epoch). A weight decay of 0.001 is used for regularization. Since the grid search is time consuming, we run only 30 epochs for each group of hyper-parameters in grid search.

Figure 3 shows the average results of 10 runs. NAMSG converges the fastest in training, requiring considerably fewer iterations to obtain the same level of train loss compared with ADAM. It also has the best performance in testing before the step size drops. The test loss and accuracy increase fast after the step size drops, and then stagnate due to overfitting. ADAM and NADAM generalize silghtly better than AMSGRAD and NAMSG, that may be caused by better exploration of the parameter space since they converge slower. SGD has the highest train loss, but performs the best in testing.

The results verify that the generalization capability of adaptive methods is worse than SGD in some models [Wilson et al., 2017]. Methods to address this issue include, e.g., data argumentation, increasing the regularization parameter or step sizes, and switching to SGD [Keskar and Socher, 2017, Luo et al., 2019]. Besides trying a relatively large step size, we define 2 strategies to switch to SGD, named as SWNTS and NAMSB (see supplementary materials for details).

Figure 3 also shows the performance of the strategies for NAMSG to promote generalization, which are the average of 10 runs. In the figure, NAMSG1 denotes NAMSG with a relatively large step size. It progresses faster than ADAM especially in testing, and finally achieves slightly higher test accuracy than SGD. SWNTS also converges faster than ADAM, and has better generalization than SGD. SWNTS1 denotes SWNTS with the step size of NAMSG1. It achieves the mean best generalization accuracy of , that is higher than the baseline of [He et al., 2016] obtained through data argumentation. The generalization performance of NAMSB is very close to SWNTS1.

Figure 3: Performance of SGD, ADAM, NADAM, AMSGRAD, and NAMSG for Resnet-20 on CIFAR-10. The top row compares the best results for training, and the bottom row shows the results of the strategies for NAMSG to promote generalization.

The experiments show that in the machine learning problems tested (see supplementary materials for more results), NAMSG converges faster compared with other popular adaptive methods, such as ADAM, NADAM, and AMSGRAD. Besides, in grid search we observe that NAMSG is faster than AMSGRAD for almost all the hyper-parameter settings, and faster than ADAM, NADAM for most of the settings. The acceleration is achieved with low computational overheads and almost no more memory. Even if the generalization gap exists in some models, it can be fixed by simple strategies. For example, we can use coarse grid search instead of a fine one, and select a relatively large step size. Switching to SGD also achieves good generalization.

6 Conclusions and discussions

We present the NAMSG method, which computes the gradients at configurable remote observation points, and scales the update vector elementwise by a nonincreasing preconditioner. It is efficient in computation and memory, and is straightforward to implement. A data-dependent regret bound is proposed to guarantee the convergence in the convex setting. Numerical experiments demonstrate that NAMSG converges faster than ADAM, NADAM, and AMSGRAD, for the tested problems. The configurable remote gradient observations can also be applied to expedite convergence in cases where other adaptive methods or SGD are more suitable than AMSGRAD. The analysis of the optimizing process as a dynamic system may be useful for studies on hyper-parameter selections.