Memory-Efficient Adaptive Optimization for Large-Scale Learning

by   Rohan Anil, et al.

Adaptive gradient-based optimizers such as AdaGrad and Adam are among the methods of choice in modern machine learning. These methods maintain second-order statistics of each parameter, thus doubling the memory footprint of the optimizer. In behemoth-size applications, this memory overhead restricts the size of the model being used as well as the number of examples in a mini-batch. We describe a novel, simple, and flexible adaptive optimization method with sublinear memory cost that retains the benefits of per-parameter adaptivity while allowing for larger models and mini-batches. We give convergence guarantees for our method and demonstrate its effectiveness in training very large deep models.


page 1

page 2

page 3

page 4


Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs

Large-scale distributed training of deep neural networks suffer from the...

Extreme Tensoring for Low-Memory Preconditioning

State-of-the-art models are now trained with billions of parameters, rea...

KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks

Kronecker-factored Approximate Curvature (K-FAC) has recently been shown...

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Adam is one of the most influential adaptive stochastic algorithms for t...

Second Order Optimization Made Practical

Optimization in machine learning, both theoretical and applied, is prese...

Compressing Gradient Optimizers via Count-Sketches

Many popular first-order optimization methods (e.g., Momentum, AdaGrad, ...

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

In several recently proposed stochastic optimization methods (e.g. RMSPr...

1 Introduction

Adaptive gradient-based optimizers such as AdaGrad [9] and Adam [14] are among the de facto methods of choice in modern machine learning. These methods adaptively tune the learning rate for each parameter during the optimization process using cumulative second-order statistics of the parameter. Often offering superior convergence properties, these methods are very attractive in large scale applications due to their moderate time and space requirements, which are linear in the number of parameters.

However, in extremely large scale applications even the modest memory overhead imposes grave limitations on the quality of the trained model. For example, recent advances in machine translation hinge on inflating the number of parameters in the trained language model to hundreds of millions. In such applications, the memory overhead of the optimizer severely restricts the size of the model that can be used as well as the number of examples in each mini-batch, both of which have been shown to have a dramatic effect on the accuracy of the model.

Motivated by these challenges, we describe an adaptive optimization method that retains the benefits of standard per-parameter adaptivity while significantly reducing its memory costs. Our construction is general and flexible, yet is remarkably simple and almost trivial to implement. We give simple convergence guarantees in the convex (stochastic and online) optimization setting, which show our method to be most effective when the gradients have a natural activation pattern

, namely, the parameters can be subdivided into (not necessarily disjoint) sets such that the gradient entries within each set are correlated with each other and tend to share a similar order of magnitude. For example, in deep networks the incoming or outgoing edges of a neuron are jointly activated and, loosely speaking, their associated gradients exhibit similar statistical characteristics. That said, we do

not assume that the activation pattern is fully-prescribed to the optimization algorithm before its run.

Large scale experiments show that our algorithm achieves comparable, and at times superior, rates of convergence to those obtained by standard, linear-space adaptive methods using the same batch size. Focusing primarily on language modeling tasks that are notorious for their huge models, we further demonstrate that the reduction in memory footprint can be utilized for a substantial increase in the batch size, which greatly speeds up convergence. As a byproduct of the diminished memory costs, our method also exhibits improved (wall-clock) runtime, which could be attributed to the reduced frequency of memory access.

1.1 Related work

Adaptive learning rates in online and stochastic optimization date back at least to [3] and were popularized in [9, 15], the former of which introduced the well-known AdaGrad algorithm. Several variants of AdaGrad have now been proposed in the optimization and machine learning literature (see [17] and the references therein), the most notable of which is the Adam algorithm [14]. All of these methods require (at least) linear space for maintaining various per-parameter statistics along their execution.

One notable exception, which is directly related to our work, is the Adafactor algorithm [21] that was proposed as a way to reduce the memory costs of Adam, primarily for training large language models. While the memory requirements of our construction are similar to Adafactor’s, the applicability as well as the convergence properties of the two algorithms are quite different. We discuss the connections and disparities in more detail in Section 3 and show an empirical comparison of the algorithms in Section 5.

Another closely related method is the Shampoo [10]

algorithm for optimization over tensor structures. The goal of Shampoo is very different, and perhaps more ambitious, than ours: going beyond entry-wise learning rates and employing

full-matrix regularization in a computationally efficient way. Nonetheless, Shampoo can also be seen as a method to substantially reduce the memory footprint of full-matrix preconditioned algorithms (specifically, full-matrix AdaGrad). In a sense, our algorithms are analogous to a diagonalized version of the Shampoo algorithm.

Yet another recent adaptive optimization method is the GGT algorithm [2]. Similarly to Shampoo, the goal of GGT is to reduce the computation cost of full-matrix preconditioning in order to make it practical in large scale settings. However, GGT stores multiple copies of the gradient over the course of its execution, and as a result, the space requirements of GGT are far from being sublinear in the size of the model.

2 Preliminaries

We begin by establishing some basic notation. For a vector

and , we use the notation to refer to vector obtained by raising each of the entries of to the power . We also use to denote the square matrix whose diagonal elements are the entries of (and whose off-diagonal entries are zeros). We use to denote the set . Finally, is the -dimensional vector whose entries are all .

2.1 Optimization setup

We henceforth assume the general online optimization setting (see [20, 11]).111For our analysis, we will assume the online convex optimization setup, in which the loss functions are convex. Optimization takes place in rounds , where in each round the algorithm has to choose a parameter vector . After making the choice on round

, the algorithm receives a loss function

which is used to perform an update to the parameters; often, and as will be the case in this paper, this update is determined by the gradient of the instantaneous loss at the current iterate . The algorithm is measured by its -round regret, defined as the quantity ; an algorithm is convergent if its regret is , i.e., if its average regret approaches zero as the number of rounds grows.

The above setup includes stochastic (possibly mini-batched) optimization as a special case. In the latter, one desires to minimize a population loss based on samples of , where defines the loss of parameters on a batch . The online loss function is then the average loss over a mini-batch received on iteration , and the stochastic gradient

is a conditionally unbiased estimate of the gradient of

at the current parameter vector . Under convexity assumptions, an online algorithm with vanishing average regret can be converted to a stochastic optimization algorithm for minimizing the population loss  [4].

2.2 Adaptive methods

For the sake of self-containment, we give a brief description of the AdaGrad algorithm [9]. AdaGrad maintains at every step the following parameter-wise accumulated statistics, computed based on the previously obtained gradients :

Relying on these statistics, the update rule of the algorithm on step takes the form:

where is an external learning rate parameter. AdaGrad has been shown to be particularly effective in training sparse models, where the effective learning rates decay in a moderate way for rare (yet possibly informative) features. In these cases, AdaGrad can potentially lead to huge gains in terms of convergence; see the discussion in [9].

2.3 Activation patterns and covers

While the theoretical analysis of AdaGrad and related algorithms does not make any assumptions on gradient values, in practice we often observe that certain entries of a gradient have similar values, and exhibit what we call an activation pattern

. For example, in embedding layers of deep networks, an entire column is either zero or non-zero. Similarly, in layers with ReLU activations it is often observed that all gradients corresponding to the same unit are jointly either zero or non-zero, and in the latter case, their absolute values share a similar order of magnitude.

In both examples, for each parameter there is a certain set of indices such that for all gradients we expect that for all . We do not attempt to formalize this notion further, and the analysis of our algorithm does not rely on a definition of an activation pattern. Rather, we leave it as an intuitive concept that serves as a motivation for our use of a cover.


A cover of a set of parameters is a collection of nonempty sets , such that and . In particular, each index may be contained in multiple sets . is the size of the cover.

Specific covers of interest include:

  1. [label=()]

  2. Singletons: for all ; this is a degenerate case which does not model any correlations between parameters.

  3. Matrix rows/columns: parameters are organized as an matrix, and each is the set of indices corresponding to a row/column of this matrix.

  4. Tensor slices: parameters are organized as a tensor of dimension , and each is an -dimensional slice of the tensor.

  5. Multiple tensors: parameters are organized in multiple tensors, each of which has its own cover. The cover is then the union of all the individual covers.

Our algorithm is provided with a prescribed cover as input, and its convergence is characterized in terms of the cover. We further argue, though only informally, that when a cover is “consistent” with the natural activation pattern of the parameters, we can expect the convergence of our algorithm to be significantly better.

3 The SM3 algorithm

The idea behind our algorithm is to keep a single variable for each set in the cover. Thus, the additional space it requires is rather than ; typically is substantially smaller than , which yields tangible savings in memory. Concretely, for each set , the algorithm maintains a running sum, , of the maximalvariance over all gradient entries . Next, for each parameter , we take the minimum over all variables associated with sets which cover , denoted . Thereafter, the learning rate corresponding to the ’th gradient entry is determined by taking the square-root of this minimum, denoted by . Accordingly, we name our algorithm the Square-root of Minima of Sums of Maxima of Squared-gradients Method, or in short, SM3. See Algorithm LABEL:alg:alg for its pseudocode.

1:parameters: learning rate , cover
3:for  do
4:     receive gradient
5:     for  do
6:         set      
7:     for  do
8:         set
9:         update      
SM3 - I


In case (i) above, where there is a set for each , the algorithm reduces to the AdaGrad algorithm [9]. The more interesting cases are where and each index is covered by multiple sets. In such settings, the memory overhead of the algorithm is sublinear in . In particular, in setting (ii) the memory footprint reduces from to , which can be quite substantial in large scale. In setting (iii) the improvement is more pronounced, as the space requirement drops from to .

The time per iteration of LABEL:alg:alg is . To see this, consider a bipartite graph defined over vertices. Nodes on one side of the graph correspond to indices , while nodes on the other side correspond to indices . The edges of the graphs are all pairs such that . The complexity of each of the inner for-loops of the algorithm scales with the number of edges in this graph, which is equal to . (Applying the update to the weights takes time, but this is always dominated by the former quantity.)

As a final remark, notice that the update rule of LABEL:alg:alg seems to involve a division by zero when . However, whenever then necessarily also . (This is a direct consequence of creftype 1 below.) In other words, whenever the denominator in the update rule is zero, the corresponding entry has zero gradient and thus need not be updated.

3.1 Analysis

We now prove convergence guarantees for LABEL:alg:alg. We first show two elementary properties of the step sizes the algorithm computes.

Claim 1.

For any and the sequence is monotonically increasing and,


The monotonicity is immediate as for any the variable is increasing in by definition, thus is also increasing for all .

Next, since for any set that contains , we have


The claim now follows since . ∎

Proposition 2.

Assume that the loss functions are convex, and let be the iterates generated by LABEL:alg:alg. Then, for any ,

where and choosing .

In particular, if the functions are stochastic samples with , e.g., each is the loss function over a batch of i.i.d. examples, then the above bound translates using standard arguments to a convergence guarantee for the average iterate of the form

In the above proposition we implicitly assume that the iterates of LABEL:alg:alg remain bounded and is a constant. This can be enforced by projecting the iterates to a bounded set of choice. We avoid introducing projections explicitly as they are rarely used in practice.

Proof of Proposition 2.

Let us first assume that for all , so that for all and due to creftype 1. The starting point of the analysis is the simple observation that LABEL:alg:alg performs Online Mirror Descent updates, where the step on round uses the positive definite diagonal matrix for regularization. Then, employing a standard regret bound for the Online Mirror Descent algorithm with time-dependent regularization (see for instance [9, Proposition 3]), the regret of the algorithm is bounded by

Here, and is the corresponding dual norm, .

Henceforth, for notational convenience we set . Simplifying the first sum above using the fact that are diagonal matrices, we have

Now, let and consider the positive definite diagonal matrix . From [10, Lemma 2] with , we have

Also, from creftype 1 we know that for all , , thus

In summary, we have established that

Plugging in and the expression for the diagonal elements of , we obtain the claim.

For the degenerate case where the matrices may not be strictly positive definite, a careful yet technical inspection of the proof above reveals that our arguments apply to this case as well by replacing inverses with pseudo-inverses. The rest of the proof remains intact as the algorithm does not update parameter on step if the corresponding diagonal entry in is zero. ∎

3.2 Discussion

Notice that adding more sets to the cover used by SM3 improves its convergence bound, but results in a worse space complexity and a higher runtime per step. Therefore, it makes sense in practice to include in the cover only the sets for which we can quickly compute the max and min operations as required by the algorithm. We discuss this point from a practical perspective in Section 4.

As we mentioned above, when and for all , LABEL:alg:alg reduces to the AdaGrad algorithm. The regret bound in Proposition 2 then precisely recovers the bound attained by AdaGrad (see [9, Eq. 6]),

In the general case, we have

as follows from creftype 1. Thus, as can be expected from a space-restricted scheme, our bound is never superior to AdaGrad’s regret bound.

Nevertheless, the two bounds above are of similar order of magnitude when the cover is consistent with the activation pattern of the gradients . Indeed, if for any entry there is a set that covers such that for all , then , and thus .

Therefore, in these scenarios we inherit the convergence properties of AdaGrad while using sublinear memory. In particular, if in addition the gradients are sparse, we can obtain an improved dependence on the dimension as discussed in Duchi et al. [9].

It is also worthwhile to compare our algorithm to Adafactor [21]. The two algorithms differ in a number of important ways. First, Adafactor is only defined for matrix-shaped parameter sets while SM3 applies to tensors of arbitrary dimensions, and even more generally, to any predefined cover of the parameters. Second, Adafactor is essentially a fixed step-size algorithm and often requires an external step-size decay schedule for ensuring convergence. SM3 in contrast decays its learning rates automatically, similarly to AdaGrad. Finally, SM3 has the benefit of entertaining rigorous, albeit elementary, convergence guarantees in the convex case.

3.3 Sm3-Ii

We now discuss a slightly more efficient variant of SM3, which we describe in LABEL:alg:alg2. It is very similar to LABEL:alg:alg, and improves on the latter in the following sense.

1:parameters: learning rate , cover
3:for  do
4:     receive gradient
5:     initialize for all
6:     for  do
7:         set
8:         update
9:         set for all      
SM3 - II


Proposition 3.

For any , the sequence is monotonically increasing. Further, fixing a sequence of gradient vectors , we have for all and that

where is the sequence produced by LABEL:alg:alg upon receiving the gradient vectors .

In other words, LABEL:alg:alg2 provides a tighter upper bound on the cumulative gradient squares than LABEL:alg:alg. Consequently, we can show, along similar lines to the proof of Proposition 2, a slightly better convergence bound for LABEL:alg:alg2 that scales with the quantity , which is always smaller than the one appearing in the bound of LABEL:alg:alg.

Proof of Proposition 3.

First, to establish monotonicity note that the algorithm maintains for and . Hence, for and we have

Let . We next prove by induction that for all and . For this is true as for all by creftype 1. For the induction step, assume that for all and write

On the other hand, we have

where the final inequality follows from the fact that, for all one has

4 Implementation details

We implemented SM3 as an optimizer in TensorFlow 

[1]. Our implementation follows the pseudocode of LABEL:alg:alg2, as it performed slightly yet consistently better than LABEL:alg:alg in our experiments (as predicted by our bounds). The implementation of LABEL:alg:alg2 optimizer will be released very soon as open source code.

Default covers.

Our implementation employs covers induced by rows and columns of matrices, and more generally, by slices of higher-order tensors (e.g., in convolutional layers). These covers allow us to exploit highly efficient tensor operations provided by GPUs and TPUs for computing max and min over the sets.


Our optimizer can be used in conjunction with momentum for improved performance. We found that momentum, set at 0.9, adds stability and allows use of larger learning rates for all optimizers that we compared.

Hyperparameters and learning-rate.

An important feature of SM3, compared to other widespread optimizers, is that it only has a single hyper-parameter that requires tuning, the learning rate . Concretely, SM3 does not rely on a learning-rate decay schedule that is often difficult to tune. The experiments reported in Table 1 of Section 5

verify this empirically. This aspect of SM3 makes it particularly appealing for training large scale models where the training time is too long to allow for exhaustive hyperparameter tuning.

Learning-rate ramp up.

Having said the above, we do often find in deep learning tasks that a high learning rate setting in the early stages of optimization causes instability and might result in failure to converge. Therefore, while SM3 does not require an external learning rate decay schedule, it is often helpful to gradually increase the parameter

from zero to its maximal value, typically over the course of the first few thousand updates. While we used this ad hoc safeguard in our experiments, we plan to replace it in the future with norm constraints on the cover sets.

5 Experiments

We demonstrate the practical benefits of SM3 on several machine learning tasks using the published state-of-the-art architectures and algorithms as baselines. We performed experiments on the following three tasks:

  1. [itemsep=0.1ex]

  2. Machine translation on two standard datasets from WMT’14: English to French (enfr) with 36.3M sentence pairs and English to German (ende) with 4.5M sentence pairs.

  3. Language modeling using a Bidrectional Encoder Representation (BERT) from Transformer [8] on the concatenation of Wikipedia and BooksCorpus [25] with 2.5B and 800M words respectively.

  4. Image classification with the ImageNet dataset 

    [18] for which there are a slew of empirical studies [7].

Experiment Optimizer Decay Rule
WMT’14 ende Adafactor
WMT’14 enfr Adafactor
BERT–Large Adam
AmoebaNet-D RMSProp
    SM3 None
Table 1: Learning rate decay schedules used by the algorithms we experimented with. Here, is the current time step, is the base learning rate, is a decay constant, is the staircase step interval, is the minimum learning rate for staircase schedule and is a large constant defining the total number of training steps.

5.1 Machine translation

We first ran our experiments using the Transformer model [23] on the smaller WMT’14 ende dataset. We trained models using the Lingvo [22] sequence modeling framework, available in TensorFlow. We compared SM3 with Adafactor which has similar space requirements. Results are provided in Figure 1 and Table 2. SM3 performed slightly better than Adafactor in both test perplexity and BLEU score of the trained models.

We then moved on to the larger WMT’14 enfr dataset using a larger transformer model (Transformer-Big) architecture from [5]. Our results are shown in Figure 2 and Table 2. We see significant (more than x) improvement in convergence rate which further translates into a substantial improvement in BLEU score.

We trained both models on a Cloud TPU-V2 [13]. A configuration has 32 cores each with 8GB of memory. The transformer model for WMT’14 ende was trained with batches of size 1536 for 700k steps. The Transformer-Big model for WMT’14 enfr was trained with the maximal batch size that could fit on each core, yielding an effective batch of size 768, for 1M steps. The Transformer-Big model consists of 6 layers for its encoder and decoder, each layer is composed of 1024 model dimensions, 8192 hidden dimensions, and 16 attention heads. In total the Transformer-Big has 375.4M parameters (1.432GB) and uses a significant fraction of the overall memory, thus making SM3 more effective there.

All experiments were run with synchronous (stochastic) gradient updates. The models used 32K word-pieces [19] for each language pair. We computed BLEU scores on the Newstest 2014 for evaluation. We also disabled checkpoint averaging in order to underscore the improved convergence rate of SM3. Our BLEU scores are not directly comparable to those of [23], instead we followed the experimental protocol described in [5]. BLEU scores were computed on tokenized, true-case outputs and without manual post-processing of the text similar to [24].

Dataset Model Optimizer BLEU
ende Transformer Adafactor 26.88
SM3 27.32
enfr Transformer-Big Adafactor 39.67
SM3 40.49
Table 2: BLEU scores on WMT’14 datasets.
Figure 1: Test loss (log perplexity) of a Transformer model on the WMT’14 ende dataset.
Figure 2: Test loss (log-perplexity) of Transformer-Big on the WMT’14 enfr dataset. Adam is infeasible with this particular batch size due to memory constraints.

5.2 Language modeling

We trained a BERT-Large language model from [8] on the combined Wikipedia and BooksCorpus [25]. BERT-Large is a large bidirectional transformer model containing 24 transformer blocks with 1024 hidden dimensions and 16 self attention heads. It has 340M parameters (1.297 GiB), and is setup to optimize two losses jointly: (a) masked language model (Masked-LM) loss where the task is to predict masked tokens based on surrounding context, and (b) next sentence prediction (NSP) loss where the task is to predict if a sentence follows another sentence where negatives sentences are randomly selected from the corpus.

We ran all our experiments using the open sourced code from [8] on an Cloud TPU-V2 configuration which has 128 cores. The baseline used was the Adam optimizer with learning rate , , and . The learning rate was warmed-up over the first 10,000 steps, followed by a linear decay. SM3 used the same warmup as a safety mechanism, with no further tinkering. Momentum was set to 0.9. We trained all models for 500K steps. We split the dataset into a train-test split.

Our results are presented in Figure 3. We see that SM3 works as well as Adam for the same batch size. However SM3 lets us train with a much larger batch size using a similar amount of memory as Adam. We were able to increase the number of examples in each batch by a factor of , yielding quality improvements and faster convergence.

Figure 3: Masked LM+NSP test loss (left) and Masked LM test accuracy (right) of BERT-Large on Wikipedia+BooksCorpus. SM3 with batch size 2048 uses about the same amount of memory as Adam with batch size 1024; using the same 1024 batch size, a step of SM3 is faster than Adam’s by 3.

5.3 AmoebaNet-D on ImageNet

We trained AmoebaNet-D described in [16] which was originally constructed to have low training cost on the ImageNet dataset. We used the open-source code available from [6] where we changed the optimizer to SM3 and removed learning rate decay. The model was trained on a Cloud TPU-v2 configuration. The baseline used RMSProp [12]

with Nesterov momentum and a staircase learning rate decay schedule. The model was trained with a batch-size of 1024, as recommended in

[6]. Our results in Figure 4 indicate that SM3 performed very well in this task and resulted in improved top-1 (77.95) and top-5 (93.89) accuracies.

Figure 4: Top-1 (left) and Top-5 (right) accuracy of AmoebaNet-D on ImageNet.

6 Conclusions

We presented SM3, a simple and effective adaptive optimization algorithm for stochastic optimization in settings where memory during training is severely limited. In these settings, the memory overhead of adaptive methods such as AdaGrad and Adam is prohibitively large, and thus limits the size of models that can be trained as well as the number of samples in each mini-batch. We demonstrated empirically that SM3 can be effectively used in such settings and dramatically decreases memory overhead. Utilizing the freed memory for increasing the batch size, our experiments show that this saving can also lead to significant gains in performance.

In future work we will focus on extending and strengthening our theoretical guarantees, improving the robustness of SM3, and further experimentation with various covers for additional domains. In particular, we plan to evaluate SM3 on training recurrent networks for speech recognition and audio generation.


We would like to thank Luke Metz, Kunal Talwar, Yonghui Wu for many discussions and helpful suggestions. Special thanks go to Samy Bengio who made it possible for us to conduct large scale experiments on a tight schedule. We would also like to thank Zhifeng Chen for coming up with the shorthand ‘SM3’.