1 Introduction
Many prominent machine learning models pose empirical risk minimization problems with objectives of the form
(1)  
(2) 
where
is a vector of parameters,
is a training set, and is a loss quantifying the performance of parameters on example . Computing the exact gradient in each step of an iterative optimization algorithm becomes inefficient for large . Instead, we sample a minibatch of size with data points drawn uniformly and independently from the training set and compute an approximate stochastic gradient(3) 
which is a random variable with
. An important quantity for this paper will be the (elementwise) variances of the stochastic gradient, which we denote by .Widelyused stochastic optimization algorithms are stochastic gradient descent
(sgd, Robbins & Monro, 1951) and its momentum variants (Polyak, 1964; Nesterov, 1983). A number of methods popular in deep learning choose perelement update magnitudes based on past gradient observations. Among these are adagrad (Duchi et al., 2011), rmsprop (Tieleman & Hinton, 2012), adadelta (Zeiler, 2012), and adam (Kingma & Ba, 2015).Notation: In the following, we occasionally drop , writing instead of , et cetera. We use shorthands like , for sequences and double indices where needed, e.g., , . Divisions, squares and squareroots on vectors are to be understood elementwise. To avoid confusion with inner products, we explicitly denote elementwise multiplication of vectors by .
1.1 A New Perspective on Adam
We start out from a reinterpretation of the widelyused adam optimizer,^{2}^{2}2Some of our considerations naturally extend to adam’s relatives rmsprop and adadelta, but we restrict our attention to adam to keep the presentation concise. which maintains moving averages of stochastic gradients and their elementwise square,
(4)  
(5) 
with and updates
(6) 
with a small constant preventing division by zero. Ignoring and assuming
for the moment, we can rewrite the update direction as
(7) 
where the sign is to be understood elementwise. Assuming that and approximate the first and second moment of the stochastic gradient—a notion that we will discuss further in §4.1— can be seen as an estimate of the stochastic gradient variances. The use of the noncentral second moment effectively cancels out the magnitude of ; it only appears in the ratio . Hence, adam can be interpreted as a combination of two aspects:

The update direction for the th coordinate is given by the sign of .

The update magnitude for the th coordinate is solely determined by the global step size and the factor
(8) where is an estimate of the relative variance,
(9)
We will refer to the second aspect as variance adaptation. The variance adaptation factors shorten the update in directions of high relative variance, adapting for varying reliability of the stochastic gradient in different coordinates.
The above interpretation of adam’s update rule has to be viewed in contrast to existing ones. A motivation given by Kingma & Ba (2015) is that is a diagonal approximation to the empirical Fisher information matrix (FIM), making adam an approximation to natural gradient descent (Amari, 1998). Apart from fundamental reservations towards the empirical Fisher and the quality of diagonal approximations (Martens, 2014, §11), this view is problematic because the FIM, if anything, is approximated by , whereas adam adapts with the squareroot .
Another possible motivation (which is not found in peerreviewed publications but circulates the community as “conventional wisdom”) is that adam performs an approximate whitening of stochastic gradients. However, this view hinges on the fact that adam divides by the squareroot of the noncentral
second moment, not by the standard deviation.
1.2 Overview
Both aspects of adam—taking the sign and variance adaptation—are briefly mentioned in Kingma & Ba (2015), who note that “[t]he effective stepsize […] is also invariant to the scale of the gradients” and refer to as a “signaltonoise ratio”. The purpose of this work is to disentangle these two aspects in order to discuss and analyze them in isolation.
This perspective naturally suggests two alternative methods by incorporating one of the aspects while excluding the other. Taking the sign of a stochastic gradient without any further modification gives rise to Stochastic Sign Descent (ssd). On the other hand, Stochastic VarianceAdapted Gradient (svag), to be derived in §3.2, applies variance adaptation directly to the stochastic gradient instead of its sign. Together with adam, the momentum variants of sgd, ssd, and svag constitute the four possible recombinations of the sign aspect and the variance adaptation, see Fig. 1.
We proceed as follows: Section 2 discusses the sign aspect. In a simplified setting we investigate under which circumstances the sign of a stochastic gradient is a better update direction than the stochastic gradient itself. Section 3 presents a principled derivation of elementwise variance adaptation factors. Subsequently, we discuss the practical implementation of varianceadapted methods (Section 4). Section 5 draws a connection to recent work on adam’s effect on generalization. Finally, Section 6 presents experimental results.
1.3 Related Work
Signbased optimization algorithms have received some attention in the past. rprop (Riedmiller & Braun, 1993) is based on gradient signs and adapts perelement update magnitudes based on observed sign changes. Seide et al. (2014) empirically investigate the use of stochastic gradient signs in a distributed setting with the goal of reducing communication cost. Karimi et al. (2016) prove convergence results for signbased methods in the nonstochastic case.
Variancebased update directions have been proposed before, e.g., by Schaul et al. (2013), where the variance appears together with curvature estimates in a diagonal preconditioner for sgd. Their variancedependent terms resemble the variance adaptation factors we will derive in Section 3. The corresponding parts of our work complement that of Schaul et al. (2013) in various ways. Most notably, we provide a principled motivation for variance adaptation that is independent of the update direction and use that to extend the variance adaptation to the momentum case.
1.4 The Sign of a Stochastic Gradient
For later use, we briefly establish some facts about the sign^{3}^{3}3To avoid a separate zerocase, we define for all theoretical considerations. Note that a.s. if . of a stochastic gradient, . The distribution of the binary random variable is fully characterized by the
success probability
, which generally depends on the distribution of . If we assumeto be normally distributed, which is supported by the Central Limit Theorem applied to Eq. (
3), we have(10) 
see §B.1 of the supplementary material. Note that is uniquely determined by the relative variance of .
2 Why the Sign?
Can it make sense to use the sign of a stochastic gradient as the update direction instead of the stochastic gradient itself? This question is difficult to tackle in a general setting, but we can get an intuition using the simple, yet insightful, case of stochastic quadratic problems, where we can investigate the effects of curvature properties and noise.
Model Problem (Stochastic Quadratic Problem, sQP).
Consider the loss function
with a symmetric positive definite matrix and “data” coming from the distribution with . The objective evaluates to(11) 
with . Stochastic gradients are given by .
2.1 Theoretical Comparison
We compare update directions on sQPs in terms of their local expected decrease in function value from a single step. For any stochastic direction , updating from to results in . For this comparison of update directions we use the optimal step size minimizing , which is easily found to be and yields an expected improvement of
(12) 
Locally, a larger expected improvement implies a better update direction. We compute this quantity for sgd () and ssd () in §B.2 of the supplementary material and find
(13)  
(14) 
where the
are the eigenvalues of
and measures the percentage of diagonal mass of . and are local quantities, depending on , which makes a general and conclusive comparison impossible. However, we can draw conclusions about how properties of the sQP affect the two update directions. We make the following two observations:Firstly, the term , which features only in , relates to the orientation of the eigenbasis of . If is diagonal, the problem is perfectly axisaligned and we have . This is the obvious best case for the intrinsically axisaligned sign update. However, can become as small as in the worst case and will on average (over random orientations) be . (We show these properties in §B.2 of the supplementary material.) This suggests that the sign update will have difficulties with arbitrarilyrotated eigenbases and crucially relies on the problem being “close to axisaligned”.
Secondly, contains the term in which stochastic noise and the eigenspectrum of the problem interact. , on the other hand, has a milder dependence on the eigenvalues of and there is no such interaction between noise and eigenspectrum. The noise only manifests in the elementwise success probabilities .
In summary, we can expect the sign direction to be beneficial for noisy, illconditioned problems with diagonally dominant Hessians. It is unclear to what extent these properties hold for real problems, on which signbased methods like adam are usually applied. Becker & LeCun (1988)
empirically investigated the first property for Hessians of simple neural network training problems and found comparably high values of
up to . Chaudhari et al. (2017) empirically investigated the eigenspectrum in deep learning problems and found it to be very illconditioned with the majority of eigenvalues close to zero and a few very large ones. However, this empirical evidence is far from conclusive.2.2 Experimental Evaluation
We verify our findings experimentally on 100dimensional sQPs. First, we specify a diagonal matrix of eigenvalues: (a) a mildlyconditioned problem with values drawn uniformly from and (b) an illconditioned problem with a structured eigenspectrum simulating the one reported by Chaudhari et al. (2017) by uniformly drawing 90% of the values from and 10% from . is then defined as (a) for an axisaligned problem and (b) with a random drawn uniformly among all rotation matrices (see Diaconis & Shahshahani, 1987). This makes four different matrices, which we consider at noise levels . We run sgd and ssd with their optimal local step sizes as previously derived. The results, shown in Fig. 2, confirm our theoretical findings.
3 Variance Adaptation
We now proceed to the second component of adam: variancebased elementwise step sizes. Considering this variance adaptation in isolation from the sign aspect naturally suggests to employ it on arbitrary update directions, for example directly on the stochastic gradient instead of its sign. A principled motivation arises from the following consideration:
Assume we want to update in a direction (or ), but only have access to an estimate with . We allow elementwise factors and update (or ). One way to make “optimal” use of these factors is to choose them such as to minimize the expected distance to the desired update direction.
Lemma 1.
Let be a random variable with and . Then is minimized by
(15) 
and is minimized by
(16) 
where . (Proof in §B.3)
3.1 ADAM as VarianceAdapted Sign Descent
According to Lemma 1, the optimal variance adaptation factors for the sign of a stochastic gradient are , where . Appealing to intuition, this means that is proportional to the success probability with a maximum of when we are certain about the sign of the gradient () and a minimum of in the absence of information ().
Recall from Eq. (10) that, under the Gaussian assumption, the success probabilities are . Figure 3 shows that this term is closely approximated by , the variance adaptation terms of adam. Hence, adam can be regarded as an approximate realization of this optimal variance adaptation scheme. This comes with the caveat that adam applies these factors to instead of . Variance adaptation for will be discussed further in §4.3 and in the supplements §C.2.
3.2 Stochastic VarianceAdapted Gradient (SVAG)
Applying Eq. (15) to , the optimal variance adaptation factors for a stochastic gradient are found to be
(17) 
A term of this form also appears, together with diagonal curvature estimates, in Schaul et al. (2013). We refer to the method updating along as Stochastic VarianceAdapted Gradient (svag). To support intuition, Fig. 4 shows a conceptual sketch of this variance adaptation scheme.
Variance adaptation of this form guarantees convergence without manually decreasing the global step size. We recover the rate of sgd for smooth, strongly convex functions. We emphasize that this result considers an idealized version of svag with exact . It should be considered as a motivation for this variance adaptation strategy, not a statement about its performance with estimated variance adaptation factors.
Theorem 1.
Let be strongly convex and smooth. We update , with stochastic gradients , , variance adaptation factors , and a global step size . Assume that there are constants such that . Then
(18) 
where is the minimum value of . (Proof in §B.4)
The assumption is a mild restriction on the variances, allowing them to be nonzero everywhere and to grow quadratically in the gradient norm.
4 Practical Implementation of MSVAG
Section 3 has introduced the general idea of variance adaptation; we now discuss its practical implementation. For the sake of a concise presentation, we focus on one particular varianceadapted method, msvag, which applies variance adaptation to the update direction . This method is of particular interest due to its relationship to adam outlined in Figure 1. Many of the following considerations correspondingly apply to other varianceadapted methods, e.g., svag and variants of adam, some of which are discussed and evaluated in the supplementary material (§C).
4.1 Estimating Gradient Variance
In practice, the optimal variance adaptation factors are unknown and have to be estimated. A key ingredient is an estimate of the stochastic gradient variance. We have argued in the introduction that adam obtains such an estimate from moving averages, . The underlying assumption is that the distribution of stochastic gradients is approximately constant over the effective time horizon of the exponential moving average, making and estimates of the first and second moment of , respectively:
Assumption 1.
At step , assume
(19) 
While this can only ever hold approximately, Assumption 1 is the tool we need to obtain gradient variance estimates from past gradient observations. It will be more realistic in the case of high noise and small step size, where the variation between successive stochastic gradients is dominated by stochasticity rather than change in the true gradient.
We make two modifications to adam’s variance estimate. First, we will use the same moving average constant for and . This constant should define the effective range for which we implicitly assume the stochastic gradients to come from the same distribution, making different constants for the first and second moment implausible.
Secondly, we adapt for a systematic bias in the variance estimate. As we show in §B.5, under Assumption 1,
(20)  
(21) 
and consequently . We correct for this bias and use the variance estimate
(22) 
MiniBatch Gradient Variance Estimates: An alternative variance estimate can be computed locally “within” a single minibatch, see §D of the supplements. We have experimented with both estimators and found the resulting methods to have similar performance. For the main paper, we stick to the moving average variant for its ease of implementation and direct correspondence with adam. We present experiments with the minibatch variant in the supplementary material. These demonstrate the merit of variance adaptation irrespective of how the variance is estimated.
4.2 Estimating the Variance Adaptation Factors
The gradient variance itself is not of primary interest; we have to estimate the variance adaptation factors, given by Eq. (17) in the case of svag. We propose to use the estimate
(23) 
While is an intuitive quantity, it is not
an unbiased estimate of the exact variance adaptation factors as defined in Eq. (
17). To our knowledge, unbiased estimation of the exact factors is intractable. We have experimented with several partial bias correction terms but found them to have destabilizing effects.4.3 Incorporating Momentum
So far, we have considered variance adaptation for the update direction . In practice, we may want to update in the direction of to incorporate momentum.^{4}^{4}4 Our use of the term momentum is somewhat colloquial. To highlight the relationship with adam (Fig. 1), we have defined msgd as the method using the update direction , which is a rescaled version of sgd with momentum. msvag applies variance adaptation to . This is not to be confused with the application of momentum acceleration (Polyak, 1964; Nesterov, 1983) on top of a svag update. According to Lemma 1, the variance adaptation factors should then be determined by the relative of variance of .
Once more adopting Assumption 1, we have and , the latter being due to Eq. (20). Hence, the relative variance of is times that of , such that the optimal variance adaptation factors for the update direction according to Lemma 1 are
(24) 
We use the following estimate thereof:
(25) 
Note that now serves a double purpose: It determines the base update direction and, at the same time, is used to obtain an estimate of the gradient variance.
4.4 Details
Note that Eq. (22) is illdefined for , since . We use for the first iteration, making the initial step of msvag coincide with an sgdstep. One final detail concerns a possible division by zero in Eq. (25). Unlike adam, we do not add a constant offset in the denominator. A division by zero only occurs when ; we check for this case and perform no update, since .
5 Connection to Generalization
Of late, the question of the effect of the optimization algorithm on generalization has received increased attention. Especially in deep learning, different optimizers might find solutions with varying generalization performance. Recently, Wilson et al. (2017) have argued that “adaptive methods” (referring to adagrad, rmsprop, and adam) have adverse effects on generalization compared to “nonadaptive methods” (gradient descent, sgd, and their momentum variants). In addition to an extensive empirical validation of that claim, the authors make a theoretical argument using a binary leastsquares classification problem,
(26) 
with data points , stacked in a matrix and a label vector . For this problem class, the nonadaptive methods provably converge to the maxmargin solution, which we expect to have favorable generalization properties. In contrast to that, Wilson et al. (2017) show that—for some instances of this problem class— the adaptive methods converge to solutions that generalize arbitrarily bad to unseen data. The authors construct such problematic instances using the following Lemma.
Lemma 2 (Lemma 3.1 in Wilson et al. (2017)).
Suppose for , and there exists such that . Then, when initialized at , the iterates generated by fullbatch adagrad, adam, and rmsprop on the objective (26) satisfy .
Intriguingly, as we show in §B.6 of the supplementary material, this statement easily extends to sign descent, i.e., the method updating .
Lemma 3.
Under the assumptions of Lemma 2, the iterates generated by sign descent satisfy .
On the other hand, this does not extend to msvag, an adaptive method by any standard. As noted before, the first step of msvag coincides with a gradient descent step. The iterates generated by msvag will, thus, not generally be proportional to . While this does by no means imply that it converges to the maxmargin solution or has otherwise favorable generalization properties, the construction of Wilson et al. (2017) does not apply to msvag.
This suggests that it is the sign that impedes generalization in the examples constructed by Wilson et al. (2017), rather than the elementwise adaptivity as such. Our experiments substantiate this suspicion. The fact that all currently popular adaptive methods are also signbased has led to a conflation of these two aspects. The main motivation for this work was to disentangle them.
6 Experiments
We experimentally compare msvag and adam to their nonvarianceadapted counterparts msgd and mssd (Alg. 2). Since these are the four possible recombinations of the sign and the variance adaptation (Fig. 1), this comparison allows us to separate the effects of the two aspects.
6.1 Experimental SetUp
We evaluated the four methods on the following problems:

A vanilla convolutional neural network (CNN) with two convolutional and two fullyconnected layers on the Fashion
mnist data set (Xiao et al., 2017). 
A vanilla CNN with three convolutional and three fullyconnected layers on cifar10 (Krizhevsky, 2009).

The wide residual network WRN404 architecture of Zagoruyko & Komodakis (2016) on cifar100.

A twolayer LSTM (Hochreiter & Schmidhuber, 1997) for characterlevel language modelling on Tolstoy’s War and Peace.
A detailed description of all network architectures has been moved to §A of the supplementary material.
For all experiments, we used for msgd, mssd and msvag and default parameters () for adam. The global step size was tuned for each method individually by first finding the maximal stable step size by trial and error, then searching downwards. We selected the one that yielded maximal test accuracy within a fixed number of training steps; a scenario close to an actual application of the methods by a practitioner. (Loss and accuracy have been evaluated at a fixed interval on the full test set as well as on an equallysized portion of the training set). Experiments with the best step size have been replicated ten times with different random seeds. While (P1) and (P2) were trained with constant , we used a decrease schedule for (P3) and (P4), which was fixed in advance for all methods. Full details can be found in §A of the supplements.
6.2 Results
Fig. 5 shows results. We make four main observations.
1) The sign aspect dominates
With the exception of (P4), the performance of the four methods distinctly clusters into signbased and nonsignbased methods. Of the two components of adam identified in §1.1, the sign aspect seems to be by far the dominant one, accounting for most of the difference between adam and msgd. adam and mssd display surprisingly similar performance; an observation that might inform practitioners’ choice of algorithm, especially for very highdimensional problems, where adam’s additional memory requirements are an issue.
2) The usefulness of the sign is problemdependent
Considering only training loss, the two signbased methods clearly outperform the two nonsignbased methods on problems (P1) and (P3). On (P2), adam and mssd make rapid initial progress, but later plateau and are undercut by msgd and msvag. On the language modelling task (P4) the nonsignbased methods show superior performance. Relating to our analysis in Section 2, this shows that the usefulness of signbased methods depends on the particular problem at hand.
3) Variance adaptation helps
In all experiments, the varianceadapted variants perform at least as good as, and often better than, their “base algorithms”. The magnitude of the effect varies. For example, adam and mssd have identical performance on (P3), but msvag significantly outperforms msgd on (P3) as well as (P4).
4) Generalization effects are caused by the sign
The cifar100 example (P3) displays similar effects as reported by Wilson et al. (2017): adam vastly outperforms msgd in training loss, but has significantly worse test performance. Observe that mssd behaves almost identical to adam in both train and test and, thus, displays the same generalizationharming effects. msvag, on the other hand, improves upon msgd and, in particular, does not display any adverse effects on generalization. This corroborates the suspicion raised in §5 that the generalizationharming effects of adam are caused by the sign aspect rather than the elementwise adaptive step sizes.
7 Conclusion
We have argued that adam combines two components: taking signs and variance adaptation. Our experiments show that the sign aspect is by far the dominant one, but its usefulness is problemdependent. Our theoretical analysis suggests that it depends on the interplay of stochasticity, the conditioning of the problem, and its axisalignment. Signbased methods also seem to have an adverse effect on the generalization performance of the obtained solution; a possible starting point for further research into the generalization effects of optimization algorithms.
The second aspect, variance adaptation, is not restricted to adam but can be applied to any update direction. We have provided a general motivation for variance adaptation factors that is independent of the update direction. In particular, we introduced msvag, a varianceadapted variant of momentum sgd, which is a useful addition to the practitioner’s toolbox for problems where signbased methods like adam
fail. A TensorFlow
(Abadi et al., 2015) implementation can be found at https://github.com/lballes/msvag.Acknowledgements
The authors thank Maren Mahsereci for helpful discussions. Lukas Balles kindly acknowledges the support of the International Max Planck Research School for Intelligent Systems (IMPRSIS).
References
 Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 Amari (1998) Amari, S.I. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
 Balles et al. (2017a) Balles, L., Mahsereci, M., and Hennig, P. Automizing stochastic optimization with gradient variance estimates. In Automatic Machine Learning Workshop at ICML 2017, 2017a.

Balles et al. (2017b)
Balles, L., Romero, J., and Hennig, P.
Coupling adaptive batch sizes with learning rates.
In
Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence (UAI)
, pp. 410–419, 2017b.  Becker & LeCun (1988) Becker, S. and LeCun, Y. Improving the convergence of backpropagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37, 1988.
 Chaudhari et al. (2017) Chaudhari, P., Choromanska, A., Soatto, S., and LeCun, Y. EntropySGD: Biasing gradient descent into wide valleys. The International Conference on Learning Representations (ICLR), 2017.
 Defazio et al. (2014) Defazio, A., Bach, F., and LacosteJulien, S. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pp. 1646–1654, 2014.
 Diaconis & Shahshahani (1987) Diaconis, P. and Shahshahani, M. The subgroup algorithm for generating uniform random variables. Probability in the Engineering and Informational Sciences, 1(01):15–32, 1987.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pp. 315–323, 2013.
 Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximalgradient methods under the PolyakLojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, 2016.
 Kingma & Ba (2015) Kingma, D. and Ba, J. ADAM: A method for stochastic optimization. The International Conference on Learning Representations (ICLR), 2015.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 Mahsereci & Hennig (2015) Mahsereci, M. and Hennig, P. Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems 28, pp. 181–189, 2015.
 Martens (2014) Martens, J. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 Nesterov (1983) Nesterov, Y. A method of solving a convex programming problem with convergence rate . In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.
 Polyak (1964) Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

Riedmiller & Braun (1993)
Riedmiller, M. and Braun, H.
A direct adaptive method for faster backpropagation learning: The RPROP algorithm.
In Neural Networks, 1993., IEEE International Conference on, pp. 586–591. IEEE, 1993.  Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400–407, 1951.
 Schaul et al. (2013) Schaul, T., Zhang, S., and LeCun, Y. No more pesky learning rates. In Proceedings of the 30th International Conference on Machine Learning (ICML), pp. 343–351, 2013.
 Seide et al. (2014) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. RMSPROP: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, Lecture 6.5, 2012.
 Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems 30, pp. 4151–4161, 2017.
 Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. FashionMNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), pp. 87.1–87.12, September 2016.
 Zeiler (2012) Zeiler, M. D. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
—Supplementary Material—
Appendix A Experiments
a.1 Network Architectures
FashionMNIST
We trained a simple convolutional neural network with two convolutional layers (size 5
5, 32 and 64 filters, respectively), each followed by maxpooling over 3
3 areas with stride 2, and a fullyconnected layer with 1024 units. ReLU activation was used for all layers. The output layer has 10 units with softmax activation. We used crossentropy loss, without any additional regularization, and a minibatch size of 64. We trained for a total of 6000 steps with a constant global step size
.Cifar10
We trained a CNN with three convolutional layers (64 filters of size 55, 96 filters of size 33, and 128 filters of size 33) interspersed with maxpooling over 33 areas with stride 2 and followed by two fullyconnected layers with 512 and 256 units. ReLU activation was used for all layers. The output layer has 10 units with softmax activation. We used crossentropy loss function and applied regularization on all weights, but not the biases. During training we performed some standard data augmentation operations (random cropping of subimages, leftright mirroring, color distortion) on the input images. We used a batch size of 128 and trained for a total of 40k steps with a constant global step size .
Cifar100
We use the WRN404 architecture of Zagoruyko & Komodakis (2016); details can be found in the original paper. We used crossentropy loss and applied regularization on all weights, but not the biases. We used the same data augmentation operations as for cifar10, a batch size of 128, and trained for 80k steps. For the global step size , we used the decrease schedule suggested by Zagoruyko & Komodakis (2016), which amounts to multiplying with a factor of 0.2 after 24k, 48k, and 64k steps. TensorFlow code was adapted from https://github.com/dalgu90/wrntensorflow.
War and Peace
We preprocessed War and Peace, extracting a vocabulary of 83 characters. The language model is a twolayer LSTM with 128 hidden units each. We used a sequence length of 50 characters and a batch size of 50. Dropout regularization was applied during training. We trained for 200k steps; the global step size was multiplied with a factor of 0.1 after 125k steps. TensorFlow code was adapted from https://github.com/sherjilozair/charrnntensorflow.
a.2 Step Size Tuning
Step sizes (initial step sizes for the experiments with a step size decrease schedule) for each optimizer have been tuned by first finding the maximal stable step size by trial and error and then searching downwards over multiple orders of magnitude, testing , , and for order of magnitude . We evaluated loss and accuracy on the full test set (as well as on an equallysized portion of the training set) at a constant interval and selected the bestperforming step size for each method in terms of maximally reached test accuracy. Using the best choice, we replicated the experiment ten times with different random seeds, randomizing the parameter initialization, data set shuffling, dropout, et cetera. In some rare cases where the accuracies for two different step sizes were very close, we replicated both and then chose the one with the higher maximum mean accuracy.
The following list shows all explored step sizes, with the “winner” in bold face.
Problem 1: Fashionmnist
msgd:
adam:
mssd:
msvag:
Problem 2: cifar10
msgd:
adam:
mssd:
msvag:
Problem 3: cifar100
msgd:
adam:
mssd:
msvag:
Problem 4: War and Peace
msgd:
adam:
mssd:
msvag:
Appendix B Mathematical Details
b.1 The Sign of a Stochastic Gradient
We have stated in the main text that the sign of a stochastic gradient, , has success probabilities
(27) 
under the assumption that . The following Lemma formally proves this statement and Figure 6 provides a pictorial illustration.
Lemma 4.
If then
(28) 
Proof.
Define . The cumulative density function (cdf) of is , where is the cdf of the standard normal distribution. If , then
(29) 
If , then
(30) 
where the last step used the antisymmetry of the error function. ∎
b.2 Analysis on Stochastic QPs
b.2.1 Derivation of and
We derive the expressions in Eq. (13), dropping the fixed from the notation for readability.
For sgd, we have and , which is a general fact for quadratic forms of random variables. For the stochastic QP the gradient covariance is , thus . Plugging everything into Eq. (12) yields
(31) 
For stochastic sign descent, , we have and thus . Regarding the denominator, it is
(32) 
since . Further, by definition of , we have . Since is positive definite, its diagonal elements are positive, such that . Plugging everything into Eq. (12) yields
(33) 
b.2.2 Properties of
By writing
in its eigendecomposition with orthonormal eigenvectors
, we find(34) 
As mentioned before, . Hence,