1 Introduction and related work
Training deep neural networks [Collobert et al., 2011, Hinton et al., 2012, Amodei et al., 2016, He et al., 2016] with large datasets costs a huge amount of time and computational resources. Efficient optimization methods are urgently required to accelerate the training process.
Firstorder optimization methods [Robbins and Monro, 1951, Polyak, 1964, Bottou, 2010, Sutskever et al., 2013, Kingma and Ba, 2015] are currently the most popular for training neural networks. These methods are easy to implement since only firstorder gradients are introduced as input. Besides, they require low computation overheads except for computing gradients, which is of the same computational complexity as just evaluating the function. Compared with secondorder methods [Nocedal, 1980, Martens, 2010, Byrd et al., 2016], they are more effective to handle gradient noise. Moreover, the noise induced by the varying minibatches may even help to escape saddle points [Ge et al., 2015].
Sutskever et al. [2013] show that the momentum is crucial to improve the performance of SGD. Momentum methods, such as HB Polyak [1964], can amplify steps in lowcurvature eigendirections of the Hessian through accumulation, although careful tuning is required to ensure fine convergence along the highcurvature directions. Sutskever et al. [2013] also rewrite the Nesterov’s Accelerated Gradient (NAG) [Nesterov, 1983] in a momentum form, and show the performance improvement over HB. The method computes the gradient at a observation point ahead of the current point along the last updating direction. They illustrate that NAG suppresses the step along high curvature eigendirections in order to prevent oscillations. However, all these approaches are approximation of their original forms derived for exact gradients, without fully study on gradient noise. Kidambi et al. [2018] show the insufficiency of HB and NAG in stochastic optimization, especially for small minibatches.
Among variants of SGD methods, adaptive methods that scale the gradient elementwise by some form of averaging of the past gradients are particularly successful. ADAGRAD [Duchi et al., 2011]
is the first popular method in this line. It is wellsuited for sparse gradients since it uses all the past gradients to scale the update. Nevertheless, it suffers from rapid decay of step sizes, in cases of nonconvex loss functions or dense gradients. Subsequent adaptive methods, such as RMSPROP
[Tieleman and Hinton., 2012], ADADELTA [Zeiler, 2012], ADAM [Kingma and Ba, 2015], and NADAM [Dozat, 2016], mitigate this problem by using the exponential moving averages of squared past gradients. However, Reddi et al. [2018] show that ADAM does not converge to optimal solutions in some convex problems, and the analysis extends to RMSPROP, ADADELTA, and NADAM. They propose AMSGRAD, which fixes the problem and shows improvements in experiments.In this paper, we propose an efficient method for training neural networks (NAMSG), which only requires firstorder gradients. The name NAMSG is derived from combining the advantages of NAG and AMSGRAD. The method computes the stochastic gradients at observation points ahead of the current parameters along the last updating direction, which is similar to Nesterov’s acceleration. Nevertheless, instead of approximating NAG for exact gradients, it expedites convergence in the stochastic setting through adjusting the learning rates for eigendirections with different curvatures, with configurable observation distance. It also scales the update vector elementwise using the nonincreasing preconditioner inherited from AMSGRAD. We analyze the convergence properties by modeling the training process as a dynamic system, reveal the benefits of remote gradient observations, and provide a guideline to select the observation distance without grid search. A regret bound of NAMSG is introduced in the convex setting, which guarantees the convergence. Finally, we present experiments to demonstrate the efficiency of NAMSG in real problems.
2 The NAMSG scheme
In this section, we present the NAMSG scheme by incorporating configurable remote gradient observations into AMSGRAD, to expedite convergence through introducing predictive information of the next update. The selection of observation distance will be further analyzed in the next section.
Before further description, we introduce the notations following Reddi et al. [2018], with slight abuse of notation. The letter denotes iteration number, denotes the dimension of vectors and matrices, denotes a predefined positive small value, and denotes the set of all positive definite matrix. For a vector and a matrices , we use to denote , to denote a square diagonal matrix with the elements of on the main diagonal, to denote the row of , and to denote . For any vectors , we use for elementwise square root, for elementwise square, for elementwise division, and to denote elementwise maximum. For any vector , denotes its coordinate where . We define as the feasible set of points. Assume that has bounded diameter , i.e. for any , and for all . The projection operation is defined as for and .
In the context of machine learning, we consider the minimization problem of a stochastic function,
(1) 
where is a dimensional vector consisting of the parameters of the model, and is a random datum consisting of an inputoutput pair. Since the distribution of is generally unavailable, the optimizing problem (1) is approximated by minimizing the empirical risk on the training set , as
(2) 
In order to save computation and avoid overfitting, it is common to estimate the objective function and its gradient with a minibatch of training data, as
(3) 
where the minibatch , and is the size of .
The AMSGRAD update [Reddi et al., 2018] can be written as
(4) 
where , , , and are configurable coefficients, , , and .
Since the updating directions are partially maintained in momentum methods, gradients computed at observation points, which lie ahead of the current point along the last updating direction, contain the predictive information of the forthcoming update. The remote observation points are defined as
(5) 
where is the updating vector, and . The observation distance can be configured to accommodate gradient noise, instead of in NAG [Sutskever et al., 2013].
By computing the gradient at the observation point , and substituting the current gradient with the observation gradient in update (4), we obtain the original form of NAMSG method, as
(6) 
where , , , , and are configurable coefficients, , , and .
Both and are required to update in (6). In order to make the method more efficient, we simplify the update by approximation. We ignore the projection and substitute by considering that provides a theoretical bound, whose boundary is generally far away from the parameters. Assuming that is close to 1, we neglect the difference between and . We also assume that the coefficients , , and , change very slowly between adjacent iterations. Then (6) is rewritten as Algorithm 1, which is named as NAMSG^{1}^{1}1For convenience in the convergence analysis, we use to denote the observation parameter vector instead of in Algorithm 1. Good default constant hyperparameters settings for the tested machine learning problems are , , , and .. Compared with AMSGRAD, NAMSG requires low computation overheads, as a scalar vector multiplication and a vector addiction per iteration, which are much cheaper than the gradient estimation. Almost no more memory is needed if the vector operations are run by pipelines. In most cases, especially when weight decay is applied for regularization, which limits the norm of the parameter vectors, the projection can also be omitted in implementation to save computation.
3 An analysis on the effect of remote gradient observations
In Algorithm 1, the observation distance is configurable to accelerate convergence. However, it is costly to select it by grid search. In this section we analyze the convergence rate in a local stochastic quadratic optimization setting by investigating the optimizing process as a dynamic system, and reveal the effect of remote gradient observation for both convex and nonconvex problems. Based on the analysis, we provide a practical guideline to set the observation distance without grid search.
The problem (1) can be approximated locally as a stochastic quadratic optimization problem, as
(7) 
where is a local set of feasible parameter points. In the problem, the gradient observation is noisy as , where is the gradient noise.
We consider the original form (6) of NAMSG, and ignore the projections for simplicity. Since varies slowly when is large, we can ignore the change of between recent iterations. The operation of dividing the update by can be approximated by solving a preconditioned problem, as
(8) 
where , . Define the preconditioned Hessian , which is supposed to have improved condition number compared with , in the convex setting.
Then, we investigate the optimization process by modeling it as a dynamic system. Solving the quadratic problem (7) by NAMSG is equal to solving the preconditioned problem (8) by a momentum method with remote gradient observations, as
(9) 
where the preconditioned stochastic function , the initial momentum , the coefficients , , and are considered as constants.
We use
to denote a unit eigenvector of the Hessian
, and the corresponding eigenvalue is
. We define the coefficients as , . According to (9), the coefficients are updated as(10) 
where the gradient error coefficient .
We further rewrite the update (10) into a dynamic system consisting of the series , as
(11) 
We assume that
obeys the normal distribution with mean 0 and standard deviation
, and define the noise level , and . Then the eigenvalues of update (11) are(12) 
where obeys the standard normal distribution.
We further define the max gain expectation as
(13) 
that is an upper bound of the expectation of logarithmic convergence rate in the dynamic system (11). In the sense of expectation, guarantees convergence along an eigendirection, while may lead to divergence. for is required to escape saddle points in nonconvex problems.
Figure 1 presents the max gain expectation for different noise level and eigenvalues obtained by numerical integration, where the momentum coefficient . Figure 1 (a) shows the case of high curvature eigendirections, in which a proper observation distance () accelerates the convergence, while a large may cause divergence. Figure 1 (b) shows the case of low curvature eigendirections. A positive accelerates the convergence for relatively high noise level , which is the common case since is small. Figure 1 (c) and (d) show the case of small negative eigenvalues in nonconvex problems. It is observed that momentum methods may be trapped by saddle points with small negative eigenvalues, and a large can mitigate the problem. The results also verify that large gradient noise is helpful to escape saddle points. Figure 1 (e) shows that a moderate observation distance (such as ) results in a large convergence domain, allowing the convergence for high curvature eigendirections. Meanwhile, a large accelerates the convergence for low curvature eigendirections, which is the key difficulty in training. However, the maximum eigenvalue allowed for convergence decreases rapidly as increases, that prohibits too large .
The analysis provides a guideline to select the observation distance without grid search. A moderate initial value of is suggested to avoid oscillations along the high curvature directions. Increase gradually to improve the performance along the lowcurvature or negativecurvature directions, and then keep it constant. For example, we set at the beginning, and increase it linearly to or
in several epochs. The analysis may also be useful to the selection of other hyperparameters.
4 Convergence Analysis
In this section, we analyze the convergence properties of NAMSG in the convex setting, and show a data dependent regret bound.
Since the sequence of cost functions are stochastic, we evaluate the convergence property of our algorithm by regret, which is the sum of all the previous difference between the online prediction and the best fixed point parameter for all the previous steps, defined as . When the regret of an algorithm satisfies , the algorithm converges to the optimal parameters on average.
Assuming that , NAMSG insures . The positive definiteness of results in a nonincreasing step size and avoids the nonconvergence of ADAM. Following Reddi et al. [2018], we derive the following key results for NAMSG.
Theorem 1. Let and be the sequences obtained from Algorithm 1, , , , , , , for all , and . For generated using the NAMSG (Algorithm 1), we have the following bound on the regret
(14)  
The proof is given in the supplementary materials, which can be downloaded at https://github.com/rationalspark/NAMSG/blob/master/1Supplementary%20Materials.pdf.
By comparing the regret bound of AMSGRAD [Reddi et al., 2018], as
(15)  
we find that the regret bounds of the two methods have the similar form. However, when is close to 1, which is the typical situation, NAMSG has lower coefficients on all of the 3 terms.
From Theorem 1, we can immediately obtain the following corollary.
Corollary 1 Suppose , then we have
(16) 
The bound in Corollary 1 is considerably better than regret of SGD when and [Duchi et al., 2011]. It should be noted that although the proof requires a decreasing schedule of and to ensure convergence, we typically use constant and in practice, which improves the convergence speed in the experiments.
5 Expreriments
In this section, we present experiments to evaluate the performance of NAMSG, compared with SGD with momentum [Polyak, 1964] and popular adaptive stochastic optimization methods, such as ADAM [Kingma and Ba, 2015], NADAM [Dozat, 2016], and AMSGRAD [Reddi et al., 2018]
. We study logistic regression and neural networks for multiclass classification, representing convex and nonconvex settings, respectively. The experiments are carried out with MXNET
[Chen et al., 2015].5.1 Experiments on MNIST
We compare the performance of SGD, ADAM, NADAM, AMSGRAD, and NAMSG for training logistic regression and neural network on the MNIST dataset [LeCun et al., 1998]. The dataset consists of 60k training images and 10k testing images in 10 classes. The image size is .
Logistic regression:In the experiment, the hyperparameters for all the methods are set as follows: The step size parameter , the coefficients , is chosen from , , and the minibatch size is . and are chosen by grid search (see supplementary materials), and the best results in training are reported. In NAMSG, the observation distance is set without grid search. According to the guideline in Section 3, it increases from to linearly in the first epoch. We report the train and test results in Figure 2, which are the average of 10 runs. It is observed that NAMSG performs the best with respect to train loss and accuracy. The test loss and accuracy are roughly consistent with the train loss in the initial epochs, after which they increase for overfitting. The experiment shows that NAMSG achieves fast convergence in the convex setting.
Neural networks:
In the experiment, we train a simple convolutional neural network (CNN) for the multiclass classification problem on MNIST. The architecture has two
convolutional layers, with 20 and 50 outputs. Each convolutional layer is followed by Batch Normalization (BN)
[Ioffe and Szegedy, 2015] and a max pooling. The network ends with a 500way fullyconnected layer with BN and ReLU [Nair and Hinton, 2010], a 10way fullyconnected layer, and softmax. The hyperparameters are set in a way similar to the previous experiment. The results are also reported in Figure 2, which are the average of 10 runs. We can see that NAMSG has the lowest train loss, which translates to good generalization performance before overfitting. The performance of AMSGRAD is close to NAMSG, and better than other methods. ADAM and NADAM are faster than SGD in the initial epochs, but they require lower learning rates in the final epochs to further reduce the train loss. The experiment shows that NAMSG is also efficient in nonconvex problems.5.2 Experiments on CIFAR10
In the experiment, we train Resnet20 [He et al., 2016] on the CIFAR10 dataset [Krizhevsky, 2009], that consists of 50k training images and 10k testing images in 10 classes. The image size is .
The architecture of the network is as follows: In training, the network inputs are images randomly cropped from the original images or their horizontal flips to save computation. The inputs are subtracted by the global mean and divided by the standard deviation. The first layer is convolutions. Then we use a stack of 18 layers with convolutions on the feature maps of sizes respectively, with 6 layers for each feature map size. The numbers of filters are respectively. A shortcut connection is added to each pair of
filters. The subsampling is performed by convolutions with a stride of 2. Batch normalization is adopted right after each convolution and before the ReLU activation. The network ends with a global average pooling, a 10way fullyconnected layer, and softmax. In testing, the original
images are used as inputs.We train Resnet20 on CIFAR10 using SGD, ADAM, NADAM, AMSGRAD, and NAMSG. The training for each network runs for 75 epochs. The hyperparameters are selected in a way similar to the previous experiments, excepting that we divide the constant step size by 10 at the iteration (in the epoch). A weight decay of 0.001 is used for regularization. Since the grid search is time consuming, we run only 30 epochs for each group of hyperparameters in grid search.
Figure 3 shows the average results of 10 runs. NAMSG converges the fastest in training, requiring considerably fewer iterations to obtain the same level of train loss compared with ADAM. It also has the best performance in testing before the step size drops. The test loss and accuracy increase fast after the step size drops, and then stagnate due to overfitting. ADAM and NADAM generalize silghtly better than AMSGRAD and NAMSG, that may be caused by better exploration of the parameter space since they converge slower. SGD has the highest train loss, but performs the best in testing.
The results verify that the generalization capability of adaptive methods is worse than SGD in some models [Wilson et al., 2017]. Methods to address this issue include, e.g., data argumentation, increasing the regularization parameter or step sizes, and switching to SGD [Keskar and Socher, 2017, Luo et al., 2019]. Besides trying a relatively large step size, we define 2 strategies to switch to SGD, named as SWNTS and NAMSB (see supplementary materials for details).
Figure 3 also shows the performance of the strategies for NAMSG to promote generalization, which are the average of 10 runs. In the figure, NAMSG1 denotes NAMSG with a relatively large step size. It progresses faster than ADAM especially in testing, and finally achieves slightly higher test accuracy than SGD. SWNTS also converges faster than ADAM, and has better generalization than SGD. SWNTS1 denotes SWNTS with the step size of NAMSG1. It achieves the mean best generalization accuracy of , that is higher than the baseline of [He et al., 2016] obtained through data argumentation. The generalization performance of NAMSB is very close to SWNTS1.
The experiments show that in the machine learning problems tested (see supplementary materials for more results), NAMSG converges faster compared with other popular adaptive methods, such as ADAM, NADAM, and AMSGRAD. Besides, in grid search we observe that NAMSG is faster than AMSGRAD for almost all the hyperparameter settings, and faster than ADAM, NADAM for most of the settings. The acceleration is achieved with low computational overheads and almost no more memory. Even if the generalization gap exists in some models, it can be fixed by simple strategies. For example, we can use coarse grid search instead of a fine one, and select a relatively large step size. Switching to SGD also achieves good generalization.
6 Conclusions and discussions
We present the NAMSG method, which computes the gradients at configurable remote observation points, and scales the update vector elementwise by a nonincreasing preconditioner. It is efficient in computation and memory, and is straightforward to implement. A datadependent regret bound is proposed to guarantee the convergence in the convex setting. Numerical experiments demonstrate that NAMSG converges faster than ADAM, NADAM, and AMSGRAD, for the tested problems. The configurable remote gradient observations can also be applied to expedite convergence in cases where other adaptive methods or SGD are more suitable than AMSGRAD. The analysis of the optimizing process as a dynamic system may be useful for studies on hyperparameter selections.
References
 Amodei et al. [2016] D. Amodei, S. Ananthanarayanan, and et al. R. Anubhai. Deep speech 2 : Endtoend speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pages 173–182, New York, New York, USA, 2016. Morgan Kaufmann.

Bottou [2010]
L. Bottou.
Largescale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, pages 177–186, Heidelberg, 2010. PhysicaVerlag HD.  Byrd et al. [2016] R. H Byrd, S. L Hansen, J. Nocedal, and Y. Singer. A stochastic quasiNewton method for largescale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
 Chen et al. [2015] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015. URL http://arxiv.org/abs/1512.01274.
 Collobert et al. [2011] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, 2011.

Dozat [2016]
Timothy Dozat.
Incorporating Nesterov momentum into Adam.
In International Conference on Learning Representations, 2016.  Duchi et al. [2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7):257–269, 2011.
 Ge et al. [2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points  online stochastic gradient for tensor decomposition. CoRR, abs/1503.02101, 2015. URL http://arxiv.org/abs/1503.02101.

He et al. [2016]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 770–778, June 2016. doi: 10.1109/CVPR.2016.90.  Hinton et al. [2012] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
 Keskar and Socher [2017] Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to SGD. CoRR, abs/1712.07628, 2017. URL http://arxiv.org/abs/1712.07628.
 Kidambi et al. [2018] Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham M. Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJTutzbA.
 Kingma and Ba [2015] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In 2015 International Conference on Learning Representations (ICLR 2015), pages 1–11, San Diego, CA, 2015.
 Krizhevsky [2009] A. Krizhevsky. Gradientbased learning applied to document recognition. Tech Report, 86(11):2278–2324, 2009.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Luo et al. [2019] Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg3g2R9FX.
 Martens [2010] J. Martens. Deep learning via Hessianfree optimization. In International Conference on Machine Learning (ICML 2010), pages 735–742, 2010.
 Nair and Hinton [2010] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.
 Nesterov [1983] Y. Nesterov. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 27(2):372–376, 1983.
 Nocedal [1980] J. Nocedal. Updating quasiNewton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980.
 Polyak [1964] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4(5):791–803, 1964.
 Reddi et al. [2018] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7fRZ.
 Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, 1951.
 Sutskever et al. [2013] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
 Tieleman and Hinton. [2012] T. Tieleman and G. Hinton. Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 Wilson et al. [2017] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems 30, pages 4148–4158. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7003themarginalvalueofadaptivegradientmethodsinmachinelearning.pdf.
 Zeiler [2012] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012. URL http://arxiv.org/abs/1212.5701.