1 Introduction
Deep Neural Networks (DNNs) have drastically advanced the stateoftheart performance in many computer science applications, including computer vision
(Krizhevsky et al., 2012), (He et al., 2016; Ren et al., 2015)(Mikolov et al., 2013; Bahdanau et al., 2014; Gehring et al., 2017) and speech recognition (Sak et al., 2014; Sercu et al., 2016). Yet, in the face of such significant developments, the ageold (accelerated) stochastic gradient descent (SGD) algorithm remains one of the most, if not the most, popular method for training DNNs
(Sutskever et al., 2013; Goodfellow et al., 2016; Wilson et al., 2017).Adaptive methods (Duchi et al., 2011; Zeiler, 2012; Hinton et al., 2012; Kingma and Ba, 2014; Ma and Yarats, 2018) sought to simplify the training process, while providing similar performance. However, while they are often used by practitioners, there are cases where their use leads to a performance gap (Wilson et al., 2017; Shah et al., 2018)
. At the same time, much of the stateoftheart performance on highly contested benchmarks—such as the image classification dataset ImageNet—have been produced with accelerated SGD
(Krizhevsky et al., 2012; He et al., 2016; Xie et al., 2017; Zagoruyko and Komodakis, 2016; Huang et al., 2017; Ren et al., 2015; Howard et al., 2017).Nevertheless, a key factor in any algorithmic success still lies in hyperparameter tuning. For example, in the literature above, they obtain such performance with a welltuned SGD with momentum and a learning rate decay schedule, or with a proper hyperparameter tuning in adaptive methods. Slight changes in learning rate, learning rate decay, momentum, and weight decay (amongst others) can drastically alter performance. Hyperparameter tuning is arguably one of the most time consuming parts of training DNNs, and researchers often resort to a costly grid search.
Thus, finding new and simple hyperparameter tuning routines that boost the performance of state of the art algorithms is of ultimate importance and one of the most pressing problems in machine learning.
The focus of this work is on the momentum parameter and how we can boost the performance of training methods with a simple technique. Momentum helps speed up learning in directions of low curvature, without becoming unstable in directions of high curvature. Minimizing the objective function , the simplest and most common momentum method, classical momentum (CM) (Polyak, 1964)
, is given by the following recursion for variable vector
:The coefficient —traditionally, selected constant in —controls how quickly the momentum decays, represents a stochastic gradient, usually , and is the step size.
But how do we select ? The most prominent choice among practitioners is . This is supported by recent works that prescribe it (Chen et al., 2016; Kingma and Ba, 2014; Hinton et al., 2012; Reddi et al., 2019)
, and by the fact that most common softwares, such as PyTorch
(Paszke et al., 2017), declare as the default value in their optimizer implementations. However, there is no indication that this choice is universally wellbehaved.There are papers that attempt to tune the momentum parameter. Under an asynchronous distributed setting, (Mitliagkas et al., 2016) observe that running SGD asynchronously is similar to adding a momentumlike term to SGD; they also provide experimental evidence that naively setting would result in a momentum “overdose”, leading to suboptimal performance. As another example, YellowFin (Zhang and Mitliagkas, 2017) is a learning rate and momentum adaptive method for both the synchronous and asynchronous setting, motivated by a quadratic model analysis and some robustness insights. The main message of that work is that, like , momentum acceleration needs to be carefully selected based on properties of the objective, the data, and the underlying computational resources. Finally, moving from classical DNN settings towards generative adversarial networks (GANs), the proposed momentum values tend to decrease from (Mirza and Osindero, 2014; Radford et al., 2015; Arjovsky et al., 2017), taking even negative values (Gidel et al., 2018).
In this paper, we introduce a novel momentum decay rule which significantly surpasses the performance of both Adam and CM (as they are used currently), in addition to other stateoftheart adaptive learning rate and adaptive momentum methods, across a variety of datasets and networks. In particular, our findings can be summarized as follows:

[leftmargin=0.7cm]

We propose a new momentum decay rule, motivated by decaying the total contribution of a gradient to all future updates, with limited overhead and additional computation.

Using the momentum decay rule with Adam, we observe large performance gains—relative to vanilla Adam—where the network continues to learn for far longer after Adam begins to plateau, and suggest that the momentum decay rule should be used as default for this method.

We observe comparative performance for CM between momentum decay and learning rate decay; a surprising finding given the unparalleled effectiveness of learning rate decay schedule.
Experiments are provided on various datasets, including MNIST, CIFAR10, CIFAR100, STL10, Penn Treebank (PTB), and networks, including Convolutional Neural Networks (CNN) with Residual architecture (ResNet)
(He et al., 2016), Wide Residual architecture (Wide ResNet) (Zagoruyko and Komodakis, 2016), NonResidual architecture (VGG16) (Simonyan and Zisserman, 2014), Recurrent Neural Networks (RNN) with Long ShortTerm Memory architecture (LSTM)
(Hochreiter and Schmidhuber, 1997), Variational AutoEncoders (VAE)
(Kingma and Welling, 2015), and the recent Noise Conditional Score Network (NCSN) (Song and Ermon, 2019).2 Preliminaries
Plain stochastic gradient descent motions. Let be the parameters of the network at time step , where is the learning rate/step size, and is the stochastic gradient w.r.t. for empirical loss , such that . Then, plain stochastic gradient descent (SGD) uses the recursion: . Here, the step size could also be time dependent, , but practice shows that decreasing the value of at regular or predefined intervals works favorably compared to decreasing the value of at every iteration.
CM is parameterized by , the momentum coefficient, and follows the recursion:
where accumulates momentum. Observe that for , the above recursion is equivalent to SGD. Common values for are closer to one, with the most used value (Ruder, 2016).
Adaptive gradient descent motions. These algorithms utilize current and past gradient information to design preconditioning matrices that better approximate the local curvature of . Beginning with AdaGrad (Duchi et al., 2011), the SGD recursion, per coordinate of , becomes:
where is usually a diagonal preconditioning matrix as a summation of squares of past gradients, and a small constant.
RMSprop (Hinton et al., 2012) substitutes the ever accumulating matrix with a root mean squared operation. Denoting the average of squared gradients as , per iteration we compute: , where was first proposed as . Here, denotes the percoordinate multiplication. Then, RMSprop updates as—where a momentum term can also be optionally added:
Finally, Adam (Kingma and Ba, 2014), in addition, keeps an exponentially decaying average of past gradients: , leading to the recursion:^{1}^{1}1For clarity, we will skip the bias correction step in this description of Adam; see Kingma and Ba (2014).
where usually and . Observe that Adam is equivalent to RMSprop when , and when no bias correction is applied results in the same recursion.
3 Demon: Decaying momentum algorithm
Motivation and interpretation. Demon is motivated by learning rate rules: by decaying the momentum parameter, we decay the total contribution of a gradient to all future updates. Similar reasoning applies for learning rate decay routines: however, our goal here is to present a concrete and easytouse momentum decay procedure, which can be used with or without learning rate routines, as we show in the experimental section. The key component is the momentum decay schedule:
(1) 
The interpretation of this rule comes from the following argument: Assume fixed momentum parameter ; e.g., , as literature dictates. For our discussion, we will use the accelerated SGD recursion. We know that , and . Then, the main recursion can be unrolled into:
Interpreting the above recursion, a particular gradient term contributes a total of of its “energy” to all future gradient updates. Moreover, for an asymptotically large number of iterations, we know that contributes on up to terms. Then, . Thus, in our quest for a decaying schedule and for a simple linear momentum decay, it is natural to consider a scheme where the cumulative momentum is decayed to . Let be the initial ; then at current step with total steps, we design the decay routine such that: . This leads to equation 1.
Connection to previous algorithms. Demon introduces an implicit discount factor. The main recursions of the algorithm are the same with standard algorithms in machine learning. E.g., for we obtain SGD with momentum, and for we obtain plain SGD in Algorithm 1; in Algorithm 2, for with a slightly adjustment of learning rate we obtain Adam, while for we obtain a nonaccumulative AdaGrad algorithm. We choose to apply Demon to a slightly adjusted Adam—instead of vanilla Adam—to isolate the effect of the momentum parameter, since the momentum parameter adjusts the magnitude of the current gradient as well in vanilla Adam.
Efficiency. Demon requires only limited extra overhead and computation in comparison to the vanilla counterparts, for the computation of .
Practical suggestions. For settings in which is typically large, such as image classification, we advocate for decaying momentum from at , to at
as a general rule. We also observe and report improved performance by delaying momentum decay till later epochs. In many cases, performance can be further improved by decaying to a small negative value, such as 0.3.
4 Related work
There are numerous techniques for automatic hyperparameter tuning. The most widely used are learning rate adaptive methods, starting with AdaGrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012), RMSprop (Hinton et al., 2012), and Adam (Kingma and Ba, 2014). Adam (Kingma and Ba, 2014), the most popular, introduced a momentum term, which is combined with the current gradient before multiplying with an adaptive learning rate. Interest in closing the generalization difference between adaptive methods and CM led to AdamW (Loshchilov and Hutter, 2017), by fixing the weight decay of Adam, and Padam (Chen and Gu, 2018)
, by lowering the exponent of the second moment.
Asynchronous methods are commonly used in deep learning, and (Mitliagkas et al., 2016) show that running SGD asynchronously is similar to adding a momentumlike term to SGD without assumptions of convexity of the objective function. They demonstrate this natural connection empirically on CNNs. This implies that the momentum parameter needs to be tuned according to the level of asynchrony. YellowFin (Zhang and Mitliagkas, 2017) is a learning rate and momentum adaptive method for both the synchronous and asynchronous setting motivated by a quadratic model analysis and robustness insights. In the nonconvex setting, STORM (Cutkosky and Orabona, 2019)
uses a variant of momentum for variance reduction.
There is substantial research, both empirical and theoretical, into the convergence of momentum methods (Wibisono and Wilson, 2015; Wibisono et al., 2016; Wilson et al., 2016; Kidambi et al., 2018). In addition, (Sutskever et al., 2013) explored momentum schedules, with even increasing momentum schedules during training, inspired by Nesterov’s routines for convex optimization. There is some work into reducing oscillations during training, by adapting the momentum (O’donoghue and Candes, 2015). There is also work into adapting momentum in wellconditioned convex problems as opposed to setting to zero (Srinivasan et al., 2018). Another approach in this area is to keep several momentum vectors according to different and combining them (Lucas et al., 2018). We are aware of the theoretical work of (Yuan,, 2016) which prove under certain conditions that momentum SGD is equivalent to SGD with a rescaled learning rate, however our experiments in the deep learning setting show slightly different behavior and understanding why is an exciting direction of research.
Smaller values of have gradually been employed for Generative Adversarial Networks (GAN), and recent developments in game dynamics (Gidel et al., 2018) show a negative momentum is helpful for GANs.
5 Experiments
Experiment short name  Model  Dataset  Optimizer 

RN18CIFAR10DEMONCM  ResNet18  CIFAR10  Demon CM 
RN18CIFAR10DEMONAdam  ResNet18  CIFAR10  Demon Adam 
VGG16CIFAR100DEMONCM  VGG16  CIFAR100  Demon CM 
VGG16CIFAR100DEMONAdam  VGG16  CIFAR100  Demon Adam 
WRNSTL10DEMONCM  Wide ResNet 168  STL10  Demon CM 
WRNSTL10DEMONAdam  Wide ResNet 168  STL10  Demon Adam 
LSTMPTBDEMONCM  LSTM RNN  Penn TreeBank  Demon CM 
LSTMPTBDEMONAdam  LSTM RNN  Penn TreeBank  Demon Adam 
VAEMNISTDEMONCM  VAE  MNIST  Demon CM 
VAEMNISTDEMONAdam  VAE  MNIST  Demon Adam 
NCSNCIFAR10DEMONAdam  NCSN  CIFAR10  Demon Adam 
We separate experiments into those with adaptive learning rate and those with adaptive momentum. All settings, with exact hyperparameters, are briefly summarized in Table 1 and comprehensively detailed in Appendix A. We report improved performance by delaying the application of Demon where applicable, and report performance across different number of total epochs to demonstrate effectiveness regardless of the training budget. Note that the predefined number of epochs we run all experiments affects the proposed decaying momentum routine, by definition of .
5.1 Adaptive methods
At first, we apply Demon Adam (Algorithm 2) to a variety of models and tasks. We select vanilla Adam as the baseline algorithm and include more recent stateoftheart adaptive learning rate methods QuasiHyperbolic Adam (QHAdam) (Ma and Yarats, 2018) and AMSGrad (Reddi et al., 2019) in our comparison. See Appendix A.2.1 for details. We tune all learning rates in roughly multiples of 3 and try to keep all other parameters close to those recommended in the original literature. For Demon Adam, we leave and decay from to in all experiments.
Residual Neural Network (RN18CIFAR10DEMONAdam). We train a ResNet18 (He et al., 2016) model on the CIFAR10 dataset. With Demon Adam, we achieve the generalization error reported in the literature (He et al., 2016) for this model, attained using CM and a curated learning rate decay schedule, whilst all other methods are noncompetitive. Refer to Table 2 for exact results.
In Figure 2 (Top row, two leftmost plots), Demon Adam is able to learn in terms of both loss and accuracy after other methods have plateaued. Running 5 seeds, Demon Adam outperforms all other methods by a large 2%5% generalization error margin with a small and large number of epochs.
30 epochs  75 epochs  150 epochs  300 epochs  

Adam  16.58 .18  13.63 .22  11.90 .06  11.94 .06 
AMSGrad  16.98 .36  13.43 .14  11.83 .12  10.48 .12 
QHAdam  16.41 .38  15.55 .25  13.78 .08  13.36 .11 
Demon Adam  11.75 .15  9.69 .10  8.83 .08  8.44 .05 
NonResidual Neural Network (VGG16CIFAR100DEMONAdam). For the CIFAR100 dataset, we train an adjusted VGG16 model (Simonyan and Zisserman, 2014). Similarly to the previous setting, we observe similar learning behavior of Demon Adam, where it continues to improve after other methods appear to begin to plateau. We note that this behavior results in a 13% decrease in generalization error than typically reported results with the same model and task (Sankaranarayanan et al., 2018), which are attained using CM and a curated learning rate decay schedule.
Running 5 seeds, Demon Adam achieves an improvement of 3%6% generalization error margin over all other methods, both for a small and large number of epochs. Refer to Figure 2 (Top row, rightmost plot) and Table 3 for more details.
VGG16  Wide Residual 168  

75 epochs  150 epochs  300 epochs  50 epochs  100 epochs  200 epochs  
Adam  37.98 .20  33.62 .11  31.09 .09  23.35 .20  19.63 .26  18.65 .07 
AMSGrad  40.67 .65  34.46 .21  31.62 .12  21.73 .25  19.35 .20  18.21 .18 
QHAdam  36.53 .20  32.96 .11  30.97 .10  21.25 .22  19.81 .18  18.52 .25 
Demon Adam  32.40 .19  28.84 .18  27.11 .19  19.42 .10  18.36 .11  17.62 .12 
Wide Residual Neural Network (WRNSTL10DEMONAdam). The STL10 dataset presents a different challenge with a significantly smaller number of images than the CIFAR datasets, but in higher resolution. We train a Wide Residual 168 model (Zagoruyko and Komodakis, 2016) for this task. In this setting, we note again the behavior of Demon Adam significantly outperforming other methods in the latter stages of training.
Running 5 seeds, Demon Adam outperforms all other methods by a 0.5%2% generalization error margin with a small and large number of epochs. Refer to Figure 2 (Bottom row, leftmost plot) and Table 3 for more details.
LSTM (PTBLSTMDEMONAdam). Language modeling can have gradient distributions which are sharp; for example, in the case of rare words. We use an LSTM (Hochreiter and Schmidhuber, 1997) model to this task. We observe overfitting for all adaptive methods.
Similar to above, running 5 seeds, Demon Adam outperforms all other methods by a 614 generalization perplexity margin, with both a small and large number of epochs. Refer to Figure 2 (Bottom row, middle plot) and Table 4 for more details.
LSTM  VAE  NCSN  

25 epochs  39 epochs  50 epochs  100 epochs  200 epochs  512 epochs  
Adam  115.54 .64  115.02 .52  136.28 .18  134.64 .14  134.66 .17  8.15 .20 
AMSGrad  108.07 .19  107.87 .25  137.89 .12  135.69 .03  134.75 .18   
QHAdam  112.52 .23  112.45 .39  136.69 .17  134.84 .08  134.12 .12   
Demon Adam  101.57 .32  101.44 .47  134.46 .17  134.12 .08  133.87 .21  8.07 .08 
Variational AutoEncoder (VAEMNISTDEMONAdam)
. Generative models are a branch of unsupervised learning that try to learn the data distribution. VAEs
(Kingma and Welling, 2015)pair a generator network with a second Neural Network, a recognition model that performs approximate inference, and can be trained with backpropagation. We train VAEs on the MNIST dataset.
Running 5 seeds, Demon Adam outperforms all other methods, particularly for smaller number of epochs. Refer to Figure 2 (Bottom row, rightmost plot) and Table 4 for more details.
Noise Conditional Score Network (NCSNCIFAR10DEMONAdam). NCSN (Song and Ermon, 2019)
is a recent generative network achieving stateoftheart inception score on CIFAR10. NCSN estimates the gradients of the data distribution with score matching. Samples are then produced via Langevin dynamics using those gradients. We train a NCSN on the CIFAR10 dataset and, using the official implementation, were unable to reproduce the reported score in the literature. NSCN trained with Adam achieves a superior inception score in Table
4, however the produced images in Figure 1 exhibit a noticeably unnatural green compared to those produced by Demon Adam.for 200 epochs. Dotted and solid lines represent training and generalization metrics respectively. Shaded bands represent one standard deviation.
5.2 Adaptive momentum methods
We apply Demon CM (Algorithm 1) to a variety of models and tasks. Since CM with learning rate decay is most often used to achieve the stateoftheart results with the architectures and tasks in question, we include CM with learning rate decay as the target to beat. CM with learning rate decay is implemented with a decay on validation error plateau, where we handtune the number of epochs to define plateau. Recent adaptive momentum methods included in this section are Aggregated Momentum (AggMo) (Lucas et al., 2018), and QuasiHyperbolic Momentum (QHM) (Ma and Yarats, 2018). We exclude accelerated SGD (Jain et al., 2017) due to difficulties in tuning. See Appendix A.2.2 for details. Similar to the last section, we tune all learning rates in roughly multiples of 3 and try to keep all other parameters close to those recommended in the original literature. For Demon CM, we leave for most experiments and generally decay from to .
Residual Neural Network (RN18CIFAR10DEMONCM). We train a ResNet18 model on the CIFAR10 dataset. With Demon CM, we achieve better generalization error than CM with learning rate decay, the optimizer for producing stateoftheart results with ResNet architecture. It is very surprising that decaying momentum can produce even better performance relative to learning rate decay.
Running 5 seeds, Demon CM outperforms all other adaptive momentum methods by a large 3%8% validation error margin with a small and large number of epochs and is competitive or better than CM with learning rate decay. In Figure 3 (Top row, two leftmost plots), Demon CM is observed to continue learning after other adaptive momentum methods appear to begin to plateau.
30 epochs  75 epochs  150 epochs  300 epochs  

CM learning rate decay  11.29 .35  9.05 .07  8.26 .07  7.97 .14 
AggMo  18.85 .27  13.02 .23  11.95 .15  10.94 .12 
QHM  14.65 .24  12.66 .19  11.27 .13  10.42 .05 
Demon CM  10.89 .12  8.97 .16  8.39 .10  7.58 .04 
NonResidual Neural Network (VGG16CIFAR100DEMONCM). For the CIFAR100 dataset, we train an adjusted VGG16 model. In Figure 3 (Top row, rightmost plot), we observe Demon CM to learn slowly initially in loss and error, but similar to the previous setting it continues to learn after other methods begin to plateau, resulting in superior final generalization error.
Running 5 seeds, Demon CM achieves an improvement of 1%8% generalization error margin over all other methods. Refer to Table 6 for more details.
VGG16  Wide Residual 168  

75 epochs  150 epochs  300 epochs  75 epochs  150 epochs  300 epochs  
CM learning rate decay  35.29 .59  30.65 .31  29.74 .43  21.05 .27  17.83 0.39  15.16 .36 
AggMo  42.85 .89  34.25 .24  32.32 .18  22.70 .11  20.06 .31  17.90 .13 
QHM  42.14 .79  33.87 .26  32.45 .13  22.86 .15  19.40 .23  17.79 .08 
Demon CM  34.35 .44  30.59 .26  28.99 .16  19.45 .20  15.98 .40  13.67 .13 
Wide Residual Neural Network (WRNSTL10DEMONCM). We train a Wide Residual 168 model for the STL10 dataset. In Figure 3 (Bottom row, leftmost plot), training in both loss and error slows down quickly for other adaptive momentum methods with a large gap with CM learning rate decay. Demon CM continues to improve and eventually catches up to CM learning rate decay.
Running 5 seeds, Demon CM outperforms all other methods by a 1.5%2% generalization error margin with a small and large number of epochs. Refer to Table 6 for more details.
LSTM (PTBLSTMDEMONCM). We train an RNN with LSTM architecture for the PTB language modeling task. Running 5 seeds, Demon CM slightly outperforms other adaptive momentum methods in generalization perplexity, and is competitive with CM with learning rate decay. Refer to Figure 3 (Bottom row, middle plot) and Table 7 for more details.
LSTM  VAE  

25 epochs  39 epochs  50 epochs  100 epochs  200 epochs  
CM learning rate decay  89.59 .07  87.57 .11  140.51 .73  139.54 .34  137.33 .49 
AggMo  89.09 .16  89.07 .15  139.69 .17  139.07 .26  137.64 .20 
QHM  94.47 .19  94.44 .13  145.84 .39  140.92 .19  137.64 .20 
Demon CM  88.33 .16  88.32 .12  139.32 .23  137.51 .29  135.95 .21 
Variational AutoEncoder (VAEMNISTDEMONCM). We train the generative model VAE on the MNIST dataset. Running 5 seeds, Demon CM outperforms all other methods by a 2%6% generalization error for a small and large number of epochs. Refer to Figure 3 (Bottom row, rightmost plot) and Table 7 for more details.
6 Conclusion
We show the effectiveness of the proposed momentum decay rule, Demon, across a number of datasets and architectures. The adaptive optimizer Adam combined with Demon is empirically substantially superior to the popular Adam, in addition to other stateoftheart adaptive learning rate algorithms, suggesting a dropin replacement. Surprisingly, it is also demonstrated that Demon CM is comparable to CM with learning rate decay. In cases where budget is limited, Demon CM may be preferable. Demon is computationally cheap, easy to understand and use, and we hope it is useful in practice and as a subject of future research.
References
 Wasserstein GAN. arXiv preprint arXiv:1701.07875. Cited by: §1.
 Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
 Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981. Cited by: §1.
 Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763. Cited by: §4.
 Momentumbased variance reduction in nonconvex sgd. arXiv preprint arXiv:1905.10018. Cited by: §4.

Incorporating nesterov momentum into adam
. ICLR Workshop, (1):2013–2016. Cited by: §A.2.1.  Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §1, §2, §4.
 Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1243–1252. Cited by: §1.
 Negative momentum for improved game dynamics. arXiv preprint arXiv:1807.04740. Cited by: §1, §4.
 Deep learning. Vol. 1, MIT Press. Cited by: §1.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: 1st item, §1, §1, §1, §5.1.  Neural networks for machine learning lecture 6a overview of minibatch gradient descent. Cited on 14, pp. 8. Cited by: §1, §1, §2, §4.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: 4th item, §1, §5.1.
 Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
 Accelerating stochastic gradient descent for least squares regression. arXiv preprint arXiv:1704.08227. Cited by: §A.2.2, §5.2.
 On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pp. 1–9. Cited by: §4.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2.1, §1, §1, §2, §4, footnote 1.
 Autoencoding variational bayes.. arXiv preprint arXiv:1312.6114. Cited by: 5th item, §1, §5.1.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §1.
 Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization 26 (1), pp. 57–95. Cited by: §A.2.2.
 Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. Cited by: §4.
 Aggregated momentum: stability through passive damping. arXiv preprint arXiv:1804.00325. Cited by: §A.2.2, §4, §5.2.
 Quasihyperbolic momentum and Adam for deep learning. arXiv preprint arXiv:1810.06801. Cited by: §A.2.1, §A.2.2, §1, §5.1, §5.2.
 Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
 Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
 Asynchrony begets momentum, with an application to deep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 997–1004. Cited by: §1, §4.
 A method for solving the convex programming problem with convergence rate of (1/kˆ2). Soviet Mathematics Doklady 27 (2), pp. 372–376. Cited by: §A.2.2.
 Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics 15 (3), pp. 715–732. Cited by: §4.
 Automatic differentiation in pytorch. Cited by: §1.
 Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4 (5), pp. 1–17. Cited by: §1.
 Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1.
 On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §A.2.1, §1, §5.1.
 Faster RCNN: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §1.
 An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §2.
 Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, Cited by: §1.
 Regularizing deep networks using efficient layerwise adversarial training. arXiv preprint arXiv:1705.07819. Cited by: §5.1.
 Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4955–4959. Cited by: §1.
 Minimum norm solutions do not always generalize well for overparameterized problems. arXiv preprint arXiv:1811.07055. Cited by: §1.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: 2nd item, §1, §5.1.
 Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600. Cited by: 6th item, §1, §5.1.

ADINE: an adaptive momentum method for stochastic gradient descent.
In
Proceedings of the ACM India Joint International Conference on Data Science and Management of Data
, pp. 249–256. Cited by: §4.  On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §1, §4.
 A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences 113 (47), pp. E7351–E7358. Cited by: §4.
 On accelerated methods in optimization. arXiv preprint arXiv:1509.03616. Cited by: §4.
 A Lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635. Cited by: §4.
 The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158. Cited by: §1, §1.
 Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1.
 On the influence of momentum acceleration on online learning. Journal of Machine Learning Research 17 (192), pp. 1–66. Cited by: §4.
 Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: 3rd item, §1, §1, §5.1.
 ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §1, §4.
 Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471. Cited by: §1, §4.
Appendix A Experiments
We evaluated the momentum decay rule with Adam and CM on Residual CNNs, Non Residual CNNS, RNNs and generative models. For CNNs, we used the image classification datasets CIFAR10, CIFAR100 and STL10 datasets. For RNNs, we used the language modeling dataset PTB. For generative modeling, we used the MNIST and CIFAR10 datasets. For each network dataset pair other than NSCN, we evaluated Adam, QHAdam, AMSGrad, Demon Adam, AggMo, QHM, Demon CM, and CM with learning rate decay. For adaptive learning rate methods and adaptive momentum methods, we generally perform a grid search over the learning rate. For CM, we generally perform a grid search over learning rate and initial momentum. For CM learning rate decay, the learning rate is decayed by a factor of 0.1 after there is no improvement in validation loss for the best of epochs.
a.1 Setup
We describe the six test problems in this paper.

[leftmargin=*]

CIFAR10  ResNet18 CIFAR10 contains 60,000 32x32x3 images with a 50,000 training set, 10,000 test set split. There are 10 classes. ResNet18 (He et al., 2016) is an 18 layers deep CNN with skip connections for image classification. Trained with a batch size of 128.

CIFAR100  VGG16 CIFAR100 is a finegrained version of CIFAR10 and contains 60,000 32x32x3 images with a 50,000 training set, 10,000 test set split. There are 100 classes. VGG16 (Simonyan and Zisserman, 2014) is a 16 layers deep CNN with extensive use of 3x3 convolutional filters. Trained with a batch size of 128

STL10  Wide ResNet 168 STL10 contains 1300 96x96x3 images with a 500 training set, 800 test set split. There are 10 classes. Wide ResNet 168 (Zagoruyko and Komodakis, 2016) is a 16 layers deep ResNet which is 8 times wider. Trained with a batch size of 64.

PTB  LSTM PTB is an English text corpus containing 929,000 training words, 73,000 validation words, and 82,000 test words. There are 10,000 words in the vocabulary. The model is stacked LSTMs (Hochreiter and Schmidhuber, 1997) with 2 layers, 650 units per layer, and dropout of 0.5. Trained with a batch size of 20.

MNIST  VAE MNIST contains 60,000 32x32x1 grayscale images with a 50,000 training set, 10,000 test set split. There are 10 classes of 10 digits. VAE (Kingma and Welling, 2015) with three dense encoding layers and three dense decoding layers with a latent space of size 2. Trained with a batch size of 100.

CIFAR10  NCSN CIFAR10 contains 60,000 32x32x3 images with a 50,000 training set, 10,000 test set split. There are 10 classes. NCSN (Song and Ermon, 2019) is a recent stateoftheart generative model which achieves the best reported inception score. We compute inception scores based on a total of 50000 samples. We follow the exact implementation in and defer details to the original paper.
a.2 Methods
a.2.1 Adaptive learning rate
Adam (Kingma and Ba, 2014), as previously introduced in section 2, keeps an exponentially decaying average of squares of past gradients to adapt the learning rate. It also introduces an exponentially decaying average of gradients.
The Adam algorithm is parameterized by learning rate , discount factors and , a small constant , and uses the update rule:
AMSGrad (Reddi et al., 2019) resolves an issue in the proof of Adam related to the exponential moving average , where Adam does not converge for a simple optimization problem. Instead of an exponential moving average, AMSGrad keeps a running maximum of .
The AMSGrad algorithm is parameterized by learning rate , discount factors and , a small constant , and uses the update rule:
where and are defined identically to Adam.
QHAdam (QuasiHyperbolic Adam) (Ma and Yarats, 2018) extends QHM (QuasiHyperbolic Momentum), introduced further below, to replace both momentum estimators in Adam with quasihyperbolic terms. This quasihyperbolic formulation is capable of recovering Adam and NAdam (Dozat, 2016), amongst others.
The QHAdam algorithm is parameterized by learning rate , discount factors and , , a small constant , and uses the update rule:
where and are defined identically to Adam.
a.2.2 Adaptive momentum
AggMo (Aggregated Momentum) (Lucas et al., 2018) takes a linear combination of multiple momentum buffers. It maintains momentum buffers, each with a different discount factor, and averages them for the update.
The AggMo algorithm is parameterized by learning rate , discount factors , and uses the update rule:
QHM (QuasiHyperbolic Momentum) (Ma and Yarats, 2018) is a weighted average of the momentum and plain SGD. QHM is capable of recovering Nesterov Momentum (Nesterov, 1983), Synthesized Nesterov Variants (Lessard et al., 2016), accSGD (Jain et al., 2017) and others.
The QHM algorithm is parameterized by learning rate , discount factor , immediate discount factor , and uses the update rule:
a.3 Optimizer hyperparameters
Optimization method  epochs  other parameters  

Adam  30  0.001  , 
Adam  75  0.001  
Adam  150  0.001  
Adam  300  0.0003  
AMSGrad  30  0.001  , 
AMSGrad  75  0.001  
AMSGrad  150  0.001  
AMSGrad  300  0.001  
QHAdam  30  0.001  , , , 
QHAdam  75  0.0003  
QHAdam  150  0.0003  
QHAdam  300  0.0003  
DEMON Adam  30  0.0001  , 
DEMON Adam  75  0.0001  
DEMON Adam  150  0.0001  
DEMON Adam  300  0.0001  
AggMo  30  0.03  
AggMo  75  0.01  
AggMo  150  0.01  
AggMo  300  0.01  
QHM  30  1.0  , 
QHM  75  0.3  
QHM  150  0.3  
QHM  300  0.3  
DEMON CM  30  0.1  
DEMON CM  75  0.1  
DEMON CM  150  0.03  
DEMON CM  300  0.03  
CM learning rate decay  30  0.1  , patience = 5 
CM learning rate decay  75  0.1  , patience = 20 
CM learning rate decay  150  0.1  , patience = 20 
CM learning rate decay  300  0.1  , patience = 40 
Optimization method  epochs  other parameters  

Adam  75  0.0003  , 
Adam  150  0.0003  
Adam  300  0.0003  
AMSGrad  75  0.0003  , 
AMSGrad  150  0.0003  
AMSGrad  300  0.0003  
QHAdam  75  0.0003  , , , 
QHAdam  150  0.0003  
QHAdam  300  0.0003  
DEMON Adam  75  0.00003  , 
DEMON Adam  150  0.00003  
DEMON Adam  300  0.00003  
AggMo  75  0.001  
AggMo  150  0.001  
AggMo  300  0.001  
QHM  75  0.1  , 
QHM  150  0.03  
QHM  300  0.03  
DEMON CM  75  0.1  
DEMON CM  150  0.03  
DEMON CM  300  0.03  
CM learning rate decay  75  0.1  , patience = 5 
CM learning rate decay  150  0.03  , patience = 20 
CM learning rate decay  300  0.03  , patience = 30 
Optimization method  epochs  

Adam  50  0.001  , 
Adam  100  0.0003  
Adam  200  0.0003  
AMSGrad  50  0.0003  , 
AMSGrad  100  0.0003  
AMSGrad  200  0.0003  
QHAdam  50  0.0003  , , , 
QHAdam  100  0.0003  
QHAdam  200  0.0003  
DEMON Adam  50  0.00003  , 
DEMON Adam  100  0.00003  
DEMON Adam  200  0.00003  
AggMo  50  0.03  
AggMo  100  0.03  
AggMo  200  0.01  
QHM  50  0.3  , 
QHM  100  0.3  
QHM  200  0.3  
DEMON CM  50  0.1  
DEMON CM  100  0.1  
DEMON CM  200  0.1  
CM learning rate decay  50  0.1  , patience = 10 
CM learning rate decay  100  0.1  , patience = 10 
CM learning rate decay  200  0.1  , patience = 20 
Optimization method  epochs  other parameters  

Adam  25  0.0003  , 
Adam  39  0.0003  
AMSGrad  25  0.001  , 
AMSGrad  39  0.001  
QHAdam  25  0.0003  , , , 
QHAdam  39  0.0003  
DEMON Adam  25  0.0001  , 
DEMON Adam  39  0.0001  
AggMo  25  0.03  
AggMo  39  0.03  
QHM  25  1.0  , 
QHM  39  1.0  
DEMON CM  25  1.0  , 
DEMON CM  39  1.0  , 
CM learning rate decay  25  0.1  , smooth learning rate decay 
CM learning rate decay  39  1.0  , smooth learning rate decay 
Optimization method  epochs  other parameters  

Adam  50  0.001  , 
Adam  100  0.001  
Adam  200  0.001  
AMSGrad  50  0.001  , 
AMSGrad  100  0.001  
AMSGrad  200  0.001  
QHAdam  50  0.001  , , , 
QHAdam  100  0.001  
QHAdam  200  0.001  
DEMON Adam  50  0.0001  , 
DEMON Adam  100  0.0001  
DEMON Adam  200  0.0001  
AggMo  50  0.000003  
AggMo  100  0.000003  
AggMo  200  0.000003  
QHM  50  0.0001  , 
QHM  100  0.00003  
QHM  200  0.00003  
DEMON CM  50  0.00001  
DEMON CM  100  0.00001  
DEMON CM  200  0.000003  
CM learning rate decay  50  0.00001  , patience = 5 
CM learning rate decay  100  0.000003  , patience = 5 
CM learning rate decay  200  0.000003  , patience = 20 