1 Introduction
An increasing share of deep learning researchers are training their models with adaptive gradient methods [3, 12] due to their rapid training time [6]. Adam [8] in particular has become the default algorithm used across many deep learning frameworks. However, the generalization and outofsample behavior of such adaptive gradient methods remains poorly understood. Given that many passes over the data are needed to minimize the training objective, typical regret guarantees do not necessarily ensure that the found solutions will generalize [17].
Notably, when the number of parameters exceeds the number of data points, it is possible that the choice of algorithm can dramatically influence which model is learned [15]. Given two different minimizers of some optimization problem, what can we say about their relative ability to generalize? In this paper, we show that adaptive and nonadaptive optimization methods indeed find very different solutions with very different generalization properties. We provide a simple generative model for binary classification where the population is linearly separable (i.e., there exists a solution with large margin), but AdaGrad [3], RMSProp [21]
, and Adam converge to a solution that incorrectly classifies new data with probability arbitrarily close to half. On this same example, SGD finds a solution with zero error on new data. Our construction shows that adaptive methods tend to give undue influence to spurious features that have no effect on outofsample generalization.
We additionally present numerical experiments demonstrating that adaptive methods generalize worse than their nonadaptive counterparts. Our experiments reveal three primary findings. First, with the same amount of hyperparameter tuning, SGD and SGD with momentum outperform adaptive methods on the development/test set across all evaluated models and tasks. This is true even when the adaptive methods achieve the
same training loss or lower than nonadaptive methods. Second, adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set. Third, the same amount of tuning was required for all methods, including adaptive methods. This challenges the conventional wisdom that adaptive methods require less tuning. Moreover, as a useful guide to future practice, we propose a simple scheme for tuning learning rates and decays that performs well on all deep learning tasks we studied.2 Background
The canonical optimization algorithms used to minimize risk are either stochastic gradient methods or stochastic momentum methods. Stochastic gradient methods can generally be written
(2.1) 
where
is the gradient of some loss function
computed on a batch of data .Stochastic momentum methods are a second family of techniques that have been used to accelerate training. These methods can generally be written as
(2.2) 
The sequence of iterates (2.2) includes Polyak’s heavyball method (HB) with , and Nesterov’s Accelerated Gradient method (NAG) [19] with .
Notable exceptions to the general formulations (2.1) and (2.2) are adaptive gradient and adaptive momentum methods, which choose a local distance measure constructed using the entire sequence of iterates . These methods (including AdaGrad [3], RMSProp [21], and Adam [8]) can generally be written as
(2.3) 
where is a positive definite matrix. Though not necessary, the matrix is usually defined as
(2.4) 
where “” denotes the entrywise or Hadamard product, , and is some set of coefficients specified for each algorithm. That is, is a diagonal matrix whose entries are the square roots of a linear combination of squares of past gradient components. We will use the fact that are defined in this fashion in the sequel. For the specific settings of the parameters for many of the algorithms used in deep learning, see Table 1. Adaptive methods attempt to adjust an algorithm to the geometry of the data. In contrast, stochastic gradient descent and related variants use the geometry inherent to the parameter space, and are equivalent to setting in the adaptive methods.
SGD  HB  NAG  AdaGrad  RMSProp  Adam  

0  0  0  
0  0  0  0  0 
In this context, generalization refers to the performance of a solution on a broader population. Performance is often defined in terms of a different loss function than the function used in training. For example, in classification tasks, we typically define generalization in terms of classification error rather than crossentropy.
2.1 Related Work
Understanding how optimization relates to generalization is a very active area of current machine learning research. Most of the seminal work in this area has focused on understanding how early stopping can act as implicit regularization [22]. In a similar vein, Ma and Belkin [10] have shown that gradient methods may not be able to find complex solutions at all in any reasonable amount of time. Hardt et al. [17] show that SGD is uniformly stable, and therefore solutions with low training error found quickly will generalize well. Similarly, using a stability argument, Raginsky et al. [16] have shown that Langevin dynamics can find solutions than generalize better than ordinary SGD in nonconvex settings. Neyshabur, Srebro, and Tomioka [15] discuss how algorithmic choices can act as implicit regularizer. In a similar vein, Neyshabur, Salakhutdinov, and Srebro [14] show that a different algorithm, one which performs descent using a metric that is invariant to rescaling of the parameters, can lead to solutions which sometimes generalize better than SGD. Our work supports the work of [14] by drawing connections between the metric used to perform local optimization and the ability of the training algorithm to find solutions that generalize. However, we focus primarily on the different generalization properties of adaptive and nonadaptive methods.
A similar line of inquiry has been pursued by Keskar et al. [7]. Horchreiter and Schmidhuber [4] showed that “sharp” minimizers generalize poorly, whereas “flat” minimizers generalize well. Keskar et al. empirically show that Adam converges to sharper minimizers when the batch size is increased. However, they observe that even with small batches, Adam does not find solutions whose performance matches stateoftheart. In the current work, we aim to show that the choice of Adam as an optimizer itself strongly influences the set of minimizers that any batch size will ever see, and help explain why they were unable to find solutions that generalized particularly well.
3 The perils of preconditioning
The goal of this section is to illustrate the following observation: when a problem has multiple global minima, different algorithms can find entirely different solutions. In particular, we will show that adaptive gradient methods might find very poor solutions. To simplify the presentation, let us restrict our attention to the simple binary leastsquares classification problem, where we can easily compute closed form formulae for the solutions found by different methods. In leastsquares classification, we aim to solve
(3.1) 
Here is an matrix of features and is an
dimensional vector of labels in
. We aim to find the best linear classifier . Note that when , if there is a minimizer with loss then there is an infinite number of global minimizers. The question remains: what solution does an algorithm find and how well does it generalize to unseen data?3.1 Nonadaptive methods
Note that most common methods when applied to (3.1) will find the same solution. Indeed, note that any gradient or stochastic gradient of must lie in the span of the rows of . Therefore, any method that is initialized in the row span of (say, for instance at ) and uses only linear combinations of gradients, stochastic gradients, and previous iterates must also lie in the row span of . The unique solution that lies in the row span of also happens to be the solution with minimum Euclidean norm. We thus denote . Almost all nonadaptive methods like SGD, SGD with momentum, minibatch SGD, gradient descent, Nesterov’s method, and the conjugate gradient method will converge to this minimum norm solution. Note that minimum norm solutions have the largest margin out of all solutions of the equation . Maximizing margin has a long and fruitful history in machine learning, and thus it is a pleasant surprise that gradient descent naturally finds a maxmargin solution.
3.2 Adaptive methods
Let us now consider the case of adaptive methods, restricting our attention to diagonal adaptation. While it is difficult to derive the general form of the solution, we can analyze special cases. Indeed, we can construct a variety of instances where adaptive methods converge to solutions with low norm rather than low norm.
For a vector , let denote the function that maps each component of to its sign.
Lemma 3.1
Suppose has no components equal to and there exists a scalar such that . Then, when initialized at , AdaGrad, Adam, and RMSProp all converge to the unique solution .
In other words, whenever there exists a solution of that is proportional to , this is precisely the solution to where all of the adaptive gradient methods converge.
Proof We prove this lemma by showing that the entire trajectory of the algorithm consists of iterates whose components have constant magnitude. In particular, we will show that
for some scalar . Note that satisfies the assertion with .
Now, assume the assertion holds for all . Observe that
where the last equation defines . Hence, letting , we also have
where denotes the componentwise absolute value of a vector and the last equation defines .
Thus we have,
(3.2)  
(3.3) 
proving the claim.
Note that this solution could be obtained without any optimization at all. One simply could subtract the means of the positive and negative classes and take the sign of the resulting vector. This solution is far simpler than the one obtained by gradient methods, and it would be surprising if such a simple solution would perform particularly well. We now turn to showing that such solutions can indeed generalize arbitrarily poorly.
3.3 Adaptivity can overfit
Lemma 3.1 allows us to construct a particularly pernicious generative model where AdaGrad fails to find a solution that generalizes. This example uses infinite dimensions to simplify bookkeeping, but one could take the dimensionality to be . Note that in deep learning, we often have a number of parameters equal to or more [20], so this is not a particularly high dimensional example by contemporary standards. For , sample the label to be with probability and with probability for some . Let be an infinite dimensional vector with entries
In other words, the first feature of is the class label. The next features are always equal to . After this, there is a set of features unique to that are equal to . If the class label is , then there is such unique feature. If the class label is , then there are such features. Note that for such a data set, the only discriminative feature is the first one! Indeed, one can perform perfect classification using only the first feature. The other features are all useless. Features and are constant, and each of the remaining features only appear for one example in the data set. However, as we will see, algorithms without such a priori knowledge may not be able to learn these distinctions.
Take samples and consider the AdaGrad solution to the minimizing . First we show that the conditions of Lemma 3.1 hold. Let and assume for the sake of simplicity that . This will happen with arbitrarily high probability for large enough . Define and observe that
Thus we have , as desired. Hence, the AdaGrad solution . In particular, has all of its components either equal to or to for some positive constant . Now since has the same sign pattern as , the first three components of are equal to each other. But for a new data point, , the only features that are nonzero in both and are the first three. In particular, we have
Therefore, the AdaGrad solution will label all unseen data as being in the positive class!
Now let’s turn to the minimum norm solution. Let and denote the set of positive and negative examples respectively. Let and . By symmetry, we have that the minimum norm solution will have the form for some nonnegative scalars and . These scalars can be found by solving . In closed form we have
(3.4) 
The algebra required to compute these coefficients can be found in the Appendix. For a new data point, , again the only features that are nonzero in both and are the first three. Thus we have
Using (3.4), we see that whenever , the SGD solution makes no errors.
Though this generative model was chosen to illustrate extreme behavior, it shares salient features of many common machine learning instances. There are a few frequent features, where some predictor based on them is a good predictor, though these might not be easy to identify from first inspection. Additionally, there are many other features which are very sparse. On finite training data it looks like such features are good for prediction, since each such feature is very discriminatory for a particular training example, but this is overfitting and an artifact of having fewer training examples then features. Moreover, we will see shortly that adaptive methods typically generalize worse than their nonadaptive counterparts on real datasets as well.
4 Deep Learning Experiments
Having established that adaptive and nonadaptive methods can find quite different solutions in the convex setting, we now turn to an empirical study of deep neural networks to see whether we observe a similar discrepancy in generalization. We compare two nonadaptive methods – SGD and the heavy ball method (HB) – to three popular adaptive methods – AdaGrad, RMSProp and Adam. We study performance on four deep learning problems: (C1) the CIFAR10 image classification task, (L1) characterlevel language modeling on the novel War and Peace, and (L2) discriminative parsing and (L3) generative parsing on Penn Treebank. In the interest of reproducibility, we use a network architecture for each problem that is either easily found online (C1, L1, L2, and L3) or produces stateoftheart results (L2 and L3). Table 2 summarizes the setup for each application. We take care to make minimal changes to the architectures and their data preprocessing pipelines in order to best isolate the effect of each optimization algorithm.
We conduct each experiment 5 times from randomly initialized starting points, using the initialization scheme specified in each code repository. We allocate a prespecified budget on the number of epochs used for training each model. When a development set was available, we chose the settings that achieved the best peak performance on the development set by the end of the fixed epoch budget. CIFAR10 did not have an explicit development set, so we chose the settings that achieved the lowest training loss at the end of the fixed epoch budget.
Our experiments show the following primary findings: (i) Adaptive methods find solutions that generalize worse than those found by nonadaptive methods. (ii) Even when the adaptive methods achieve the same training loss or lower than nonadaptive methods, the development or test performance is worse. (iii) Adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development set. (iv) Though conventional wisdom suggests that Adam does not require tuning, we find that tuning the initial learning rate and decay scheme for Adam yields significant improvements over its default settings in all cases.
Name  Network type  Architecture  Dataset  Framework 

C1  Deep Convolutional  cifar.torch 
CIFAR10  Torch 
L1  2Layer LSTM  torchrnn  War & Peace  Torch 
L2  2Layer LSTM + Feedforward  spanparser  Penn Treebank  DyNet 
L3  3Layer LSTM  emnlp2016  Penn Treebank  Tensorflow 
4.1 Hyperparameter Tuning
Optimization hyperparameters have a large influence on the quality of solutions found by optimization algorithms for deep neural networks. The algorithms under consideration have many hyperparameters: the initial step size , the step decay scheme, the momentum value , the momentum schedule , the smoothing term , the initialization scheme for the gradient accumulator, and the parameter controlling how to combine gradient outer products, to name a few. A grid search on a large space of hyperparameters is infeasible even with substantial industrial resources, and we found that the parameters that impacted performance the most were the initial step size and the step decay scheme. We left the remaining parameters with their default settings. We describe the differences between the default settings of Torch, DyNet, and Tensorflow in Appendix B for completeness.
To tune the step sizes, we evaluated a logarithmicallyspaced grid of five step sizes. If the best performance was ever at one of the extremes of the grid, we would try new grid points so that the best performance was contained in the middle of the parameters. For example, if we initially tried step sizes , , , , and and found that was the best performing, we would have tried the step size to see if performance was improved. If performance improved, we would have tried and so on. We list the initial step sizes we tried in Appendix C.
For step size decay, we explored two separate schemes, a developmentbased decay scheme (devdecay) and a fixed frequency decay scheme (fixeddecay). For devdecay, we keep track of the best validation performance so far, and at each epoch decay the learning rate by a constant factor if the model does not attain a new best value. For fixeddecay, we decay the learning rate by a constant factor every epochs. We recommend the devdecay scheme when a development set is available; not only does it have fewer hyperparameters than the fixed frequency scheme, but our experiments also show that it produces results comparable to, or better than, the fixeddecay scheme.
4.2 Convolutional Neural Network
We used the VGG+BN+Dropout network for CIFAR10 from the Torch blog [23], which in prior work achieves a baseline test error of . Figure 1 shows the learning curve for each algorithm on both the training and test dataset.
We observe that the solutions found by SGD and HB do indeed generalize better than those found by adaptive methods. The best overall test error found by a nonadaptive algorithm, SGD, was , whereas the best adaptive method, RMSProp, achieved a test error of .
Early on in training, the adaptive methods appear to be performing better than the nonadaptive methods, but starting at epoch 50, even though the training error of the adaptive methods is still lower, SGD and HB begin to outperform adaptive methods on the test error. By epoch 100, the performance of SGD and HB surpass all adaptive methods on both train and test. Among all adaptive methods, AdaGrad’s rate of improvement flatlines the earliest. We also found that by increasing the step size, we could drive the performance of the adaptive methods down in the first 50 or so epochs, but the aggressive step size made the flatlining behavior worse, and no step decay scheme could fix the behavior.
4.3 CharacterLevel Language Modeling
Using the torchrnn library, we train a characterlevel language model on the text of the novel War and Peace, running for a fixed budget of 200 epochs. Our results are shown in Figures 2 and 2.
Under the fixeddecay scheme, the best configuration for all algorithms except AdaGrad was to decay relatively late with regards to the total number of epochs, either 60 or 80% through the total number of epochs and by a large amount, dividing the step size by 10. The devdecay scheme paralleled (within the same standard deviation) the results of the exhaustive search over the decay frequency and amount; we report the curves from the fixed policy.
Overall, SGD achieved the lowest test loss at
. AdaGrad has fast initial progress, but flatlines. The adaptive methods appear more sensitive to the initialization scheme than nonadaptive methods, displaying a higher variance on both train and test. Surprisingly, RMSProp closely trails SGD on test loss, confirming that it is not impossible for adaptive methods to find solutions that generalize well. We note that there are step configurations for RMSProp that drive the training loss below that of SGD, but these configurations cause erratic behavior on test, driving the test error of RMSProp above Adam.
4.4 Constituency Parsing
A constituency parser is used to predict the hierarchical structure of a sentence, breaking it down into nested clauselevel, phraselevel, and wordlevel units. We carry out experiments using two stateoftheart parsers: the standalone discriminative parser of Cross and Huang [2], and the generative reranking parser of Choe and Charniak [1]. In both cases, we use the devdecay scheme with for learning rate decay.
Discriminative Model.
Cross and Huang [2] develop a transitionbased framework that reduces constituency parsing to a sequence prediction problem, giving a onetoone correspondence between parse trees and sequences of structural and labeling actions. Using their code with the default settings, we trained for 50 epochs on the Penn Treebank [11], comparing labeled F1 scores on the training and development data over time. RMSProp was not implemented in the used version of DyNet, and we omit it from our experiments. Results are shown in Figures 2 and 2.
We find that SGD obtained the best overall performance on the development set, followed closely by HB and Adam, with AdaGrad trailing far behind. The default configuration of Adam without learning rate decay actually achieved the best overall training performance by the end of the run, but was notably worse than tuned Adam on the development set.
Interestingly, Adam achieved its best development F1 of 91.11 quite early, after just 6 epochs, whereas SGD took 18 epochs to reach this value and didn’t reach its best F1 of 91.24 until epoch 31. On the other hand, Adam continued to improve on the training set well after its best development performance was obtained, while the peaks for SGD were more closely aligned.
Generative Model.
Choe and Charniak [1] show that constituency parsing can be cast as a language modeling problem, with trees being represented by their depthfirst traversals. This formulation requires a separate base system to produce candidate parse trees, which are then rescored by the generative model. Using an adapted version of their code base,^{4}^{4}4While the code of Choe and Charniak treats the entire corpus as a single long example, relying on the network to reset itself upon encountering an endofsentence token, we use the more conventional approach of resetting the network for each example. This reduces training efficiency slightly when batches contain examples of different lengths, but removes a potential confounding factor from our experiments. we retrained their model for 100 epochs on the Penn Treebank. However, to reduce computational costs, we made two minor changes: (a) we used a smaller LSTM hidden dimension of 500 instead of 1500, finding that performance decreased only slightly; and (b) we accordingly lowered the dropout ratio from 0.7 to 0.5. Since they demonstrated a high correlation between perplexity (the exponential of the average loss) and labeled F1 on the development set, we explored the relation between training and development perplexity to avoid any conflation with the performance of a base parser.
5 Conclusion
Despite the fact that our experimental evidence demonstrates that adaptive methods are not advantageous for machine learning, the Adam algorithm remains incredibly popular. We are not sure exactly as to why, but hope that our stepsize tuning suggestions make it easier for practitioners to use standard stochastic gradient methods in their research. In our conversations with other researchers, we have surmised that adaptive gradient methods are particularly popular for training GANs [18, 5] and Qlearning with function approximation [13, 9]. Both of these applications stand out because they are not solving optimization problems. It is possible that the dynamics of Adam are accidentally well matched to these sorts of optimizationfree iterative search procedures. It is also possible that carefully tuned stochastic gradient methods may work as well or better in both of these applications. It is an exciting direction of future work to determine which of these possibilities is true and to understand better as to why.
Acknowledgements
The authors would like to thank Pieter Abbeel, Moritz Hardt, Tomer Koren, Sergey Levine, Henry Milner, Yoram Singer, and Shivaram Venkataraman for many helpful comments and suggestions. RR is generously supported by DOE award AC0205CH11231. MS and AW are supported by NSF Graduate Research Fellowships. NS is partially supported by NSFIIS1302662 and NSFIIS1546500, an Inter ICRIRI award and a Google Faculty Award. BR is generously supported by NSF award CCF1359814, ONR awards N000141410024 and N000141712191, the DARPA Fundamental Limits of Learning (Fun LoL) Program, a Sloan Research Fellowship, and a Google Faculty Award.
References

[1]
D. K. Choe and E. Charniak.
Parsing as language modeling.
In J. Su, X. Carreras, and K. Duh, editors,
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016
, pages 2331–2336. The Association for Computational Linguistics, 2016.  [2] J. Cross and L. Huang. Spanbased constituency parsing with a structurelabel system and provably optimal dynamic oracles. In J. Su, X. Carreras, and K. Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pages 1–11. The Association for Computational Linguistics, 2016.
 [3] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 [4] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
 [5] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. arXiv:1611.07004, 2016.
 [6] A. Karparthy. A peak at trends in machine learning. https://medium.com/@karpathy/apeekattrendsinmachinelearningab8a1085a106. Accessed: 20170517.
 [7] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On largebatch training for deep learning: Generalization gap and sharp minima. In The International Conference on Learning Representations (ICLR), 2017.
 [8] D. Kingma and J. Ba. Adam: A method for stochastic optimization. The International Conference on Learning Representations (ICLR), 2015.

[9]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning.
In International Conference on Learning Representations (ICLR), 2016.  [10] S. Ma and M. Belkin. Diving into the shallows: a computational perspective on largescale shallow learning. arXiv:1703.10622, 2017.
 [11] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313–330, 1993.
 [12] H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.
 [13] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
 [14] B. Neyshabur, R. Salakhutdinov, and N. Srebro. PathSGD: Pathnormalized optimization in deep neural networks. In Neural Information Processing Systems (NIPS), 2015.
 [15] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations (ICLR), 2015.
 [16] M. Raginsky, A. Rakhlin, and M. Telgarsky. Nonconvex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. arXiv:1702.03849, 2017.
 [17] B. Recht, M. Hardt, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the International Conference on Machine Learning (ICML), 2016.
 [18] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In Proceedings of The International Conference on Machine Learning (ICML), 2016.
 [19] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning (ICML), 2013.

[20]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  [21] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 [22] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
 [23] S. Zagoruyko. Torch blog. http://torch.ch/blog/2015/07/30/cifar.html, 2015.
Appendix A Full details of the minimum norm solution from Section 3.3
Full Details.
The simplest derivation of the minimum norm solution uses the kernel trick. We know that the optimal solution has the form where and . Note that
Positing that if and if leaves us with the equation
Solving this system of equations yields (3.4).
Appendix B Differences between Torch, DyNet, and Tensorflow
Torch  Tensorflow  DyNet  
SGD Momentum  0  No default  0.9 
AdaGrad Initial Mean  0  0.1  0 
AdaGrad  1e10  Not used  1e20 
RMSProp Initial Mean  0  1.0  – 
RMSProp  0.99  0.9  – 
RMSProp  1e8  1e10  – 
Adam  0.9  0.9  0.9 
Adam  0.999  0.999  0.999 
Table 3 lists the default values of the parameters for the various deep learning packages used in our experiments. In Torch, the Heavy Ball algorithm is callable simply by changing default momentum away from 0 with nesterov=False. In Tensorflow and DyNet, SGD with momentum is implemented separately from ordinary SGD. For our Heavy Ball experiments we use a constant momentum of .
Appendix C Step sizes used for parameter tuning
Cifar10

SGD: {2, 1, 0.5 (best), 0.25, 0.05, 0.01}

HB: {2, 1, 0.5 (best), 0.25, 0.05, 0.01}

AdaGrad: {0.1, 0.05, 0.01 (best, def.), 0.0075, 0.005}

RMSProp: {0.005, 0.001, 0.0005, 0.0003 (best), 0.0001}

Adam: {0.005, 0.001 (default), 0.0005, 0.0003 (best), 0.0001, 0.00005}
The default Torch step sizes for SGD (0.001) , HB (0.001), and RMSProp (0.01) were outside the range we tested.
War & Peace

SGD: {2, 1 (best), 0.5, 0.25, 0.125}

HB: {2, 1 (best), 0.5, 0.25, 0.125}

AdaGrad: {0.4, 0.2, 0.1, 0.05 (best), 0.025}

RMSProp: {0.02, 0.01, 0.005, 0.0025, 0.00125, 0.000625, 0.0005 (best), 0.0001}

Adam: {0.005, 0.0025, 0.00125, 0.000625 (best), 0.0003125, 0.00015625}
Under the fixeddecay scheme, we selected learning rate decay frequencies from the set
and learning rate decay amounts from the set .
Discriminative Parsing

SGD: {1.0, 0.5, 0.2, 0.1 (best), 0.05, 0.02, 0.01}

HB: {1.0, 0.5, 0.2, 0.1, 0.05 (best), 0.02, 0.01, 0.005, 0.002}

AdaGrad: {1.0, 0.5, 0.2, 0.1, 0.05, 0.02 (best), 0.01, 0.005, 0.002, 0.001, 0.0005, 0.0002, 0.0001}

RMSProp: Not implemented in DyNet.

Adam: {0.01, 0.005, 0.002 (best), 0.001 (default), 0.0005, 0.0002, 0.0001}
Generative Parsing

SGD: {1.0, 0.5 (best), 0.25, 0.1, 0.05, 0.025, 0.01}

HB: {0.25, 0.1, 0.05, 0.02, 0.01 (best), 0.005, 0.002, 0.001}

AdaGrad: {5.0, 2.5, 1.0, 0.5, 0.25 (best), 0.1, 0.05, 0.02, 0.01}

RMSProp: {0.05, 0.02, 0.01, 0.005, 0.002 (best), 0.001, 0.0005, 0.0002, 0.0001}

Adam: {0.005, 0.001, 0.001 (default), 0.0005 (best), 0.0002, 0.0001}