1 Introduction
Secondorder gradient methods are among the most powerful algorithms in mathematical optimization. Algorithms in this family use a preconditioner matrix to transform the gradient before applying each step. Classically, this involves computing or approximating the matrix of secondorder derivatives, i.e, the Hessian, in the context of exact deterministic optimization (e.g.,
fletcher2013practical; lewis2013nonsmooth; nocedal1980updating). In contrast, AdaGrad (duchi2011adaptive) and related algorithms that target stochastic optimization use the covariance matrix of secondorder gradient statistics to form the preconditioner.While secondorder methods often have significantly better convergence properties than firstorder methods, the size of typical problems prohibits their use in practice, as they require quadratic storage and cubic computation time for each gradient update. Thus, these methods not commonly seen in the present practice of optimization in machine learning, which is largely dominated by the simpler to implement firstorder methods. Arguably, one of the greatest challenges of modern optimization is to bridge this gap between the theoretical and practical optimization and make secondorder optimization more feasible to implement and deploy.
In this paper, we attempt to contribute towards narrowing this gap between theory and practice, focusing on secondorder adaptive methods. These methods can be thought of as fullmatrix analogues of common adaptive algorithms of the family of AdaGrad (duchi2011adaptive) and Adam (kingma2014adam). The methods maintain a matrix, akin to a covariance matrix that accumulates the outer products of the stochastic gradients, which is used by the algorithm to precondition gradient at each step. These fullmatrix versions are potentially more powerful than firstorder methods as they can exploit statistical correlations between (gradients of) different parameters, but at the same time, suffer from the said prohibitive runtime and memory costs.
Recent developments in the space of secondorder methods, on which we focus in this paper, include the KFAC (kfac) and Shampoo (shampooicml)
algorithms that exploits the structure of deep networks (and more generally, models described by a collection of tensors) for mitigating the space and runtime costs of fullmatrix secondorder algorithms. These methods approximate each preconditioning matrix using a factored representation that stems from the network structure. However, in very large applications, such algorithms are still impractical due to their serial nature, as well as due to a number of numerical and infrastructural pitfalls that face any attempt to implement a fullmatrix optimization method.
1.1 Our contributions
We provide solutions to practical concerns and challenges that arise in implementing and using secondorder methods in immense scale. Our concrete focus will be on the Shampoo algorithm, but most of the challenges we address are faced by any attempt to implement a secondorder method. These include:

[nosep]

We replace expensive spectral decompositions (SVD) with an efficient iterative method for computing roots of PSD matrices.

To further mitigate the runtime cost of computing the above root computation, we design and implement an asynchronous version of the algorithm. This approach exploits the heterogeneity and computing power of CPUAccelerator coupled architectures.

We extend Shampoo in a number of ways so as to make it applicable to a larger range of deep architectures; in particular, this extension facilitates the usage of the algorithm for training very large embedding layers.

We describe practical challenges and limitations of the algorithm in its current form, which we argue could be useful in the design of next generation hardware architecture with increased onchip memory and higher precision matrix multiplies.
Applying our novel distributed implementation of the modified algorithm, we demonstrate superior performance on very large training problems in machine translation. Our implementation achieves up to 1.67x speedup in training time compared to the best published optimizer. Moreover, our implementation runs considerably faster (in total wallclock time) than any first order method while achieving on par or better accuracy.
To the best of our knowledge, our design and implementation is the first to demonstrate the power and scalability of secondorder methods in practice, with its actual walltime beating all widelyused firstorder methods on problems with very large number of parameters and examples.
1.2 Related work
Various approximations to the preconditioning matrix have been proposed in the recent literature, (e.g., gonen2015faster; erdogdu2015convergence; agarwal2016second; xu2016sub; pilanci2017newton). However, so far the only prevalent and pragmatic approximation is the diagonal approximation used by widespread (often adaptive) optimizers.
Some recent approaches for approximating the fullmatrix preconditioner are KFAC (kfac), Shampoo (shampooicml) and GGT (GGT). KFAC uses a factored approximation of the Fisherinformation matrix as a preconditioner. While our focus in this paper is on the Shampoo algorithm, we believe that many of the techniques presented here could also be applied to make KFAC practical in large scale. GGT uses a clever trick to compute a lowrank approximation to the AdaGrad preconditioner. However, GGT maintains several hundred copies of the gradient in memory, which is too expensive even for midsized models.
1.3 Paper organization
The rest of the paper is organized as follows. In Section 2 we provide some background on preconditioning methods and describe the Shampoo algorithm. We next discuss the various challenges one faces in a practical implementation of a secondorder methods in Section 3. In Section 4
we describe the design of the distributed version of Shampoo with accelerators for deep learning, and then describe the improvements we made to Shampoo to make it work with our system. Finally, in
Section 6 we describe experiments on several datasets, showing that our implementation significantly outperforms common firstorder methods such as SGD, Adam and AdaGrad.1.4 Notation
We will use lowercase letters to denote scalars and vectors, and uppercase letters to denote matrices. We use
to denote the Loewner order: given square symmetric matrices , we write iff is positive semidefinite (PSD). Given a symmetric PSD matrix , and , is defined as follows: letbe the singular value decomposition of
, whereis a unitary matrix and
is a diagonal matrix (with as is PSD), then , where . We use to denote the Hadamard or elementwise product of and which have the same shape, so . will denote the Kronecker product of any two matrices and . We use to denote the flattening of the matrix : if has rows , then is the column vector . denotes the Frobenius norm of : .2 Background
2.1 Adaptive preconditioning methods
First order methods iteratively update the parameters solely based on gradient information: where and are (column) vectors in . Here denotes a linear combination of the current and past gradients , where different algorithms use different combinations. In contrast, preconditioned methods take the following form: where is an matrix. Whereas in Newtontype methods this matrix is related to the Hessian matrix of secondorder derivatives, adaptive gradient methods form their preconditioning matrix based on gradientgradient correlations.
The parameters of a deep network form a set where each element of the set is typically an order two (i.e. a matrix), three, or four tensor. For simplicity of the presentation we focus on the matrix case—however our design, analysis, and implementation holds for tensors of arbitrary order. We denote the space of parameters by the matrix
and an estimate of the gradient at
by .A full matrix preconditioning would flatten and represent it as a vector of dimension . It thus requires space and would take time to perform the update. Even if we focus merely on a single layer of a deep network, and would be in the 1000’s in stateoftheart models thus rendering fullmatrix preconditioning impractical. For this reason, AdaGrad and analogously Adam, constrain the preconditioning matrices to be diagonal. Shampoo bridges the gap between full matrix preconditioning and the diagonal version by approximating the matrices.
2.2 The Shampoo algorithm
As noted above, we describe the Shampoo algorithm for matrixshaped parameter spaces, for brevity. All of our modifications to Shampoo are extended and implemented for tensors of arbitrary dimension.
We describe Shampoo in the context of the Online Convex Optimization framework, which is closely related to (in fact, generalizes and extends) stochastic optimization (see, e.g., shalev2012online; hazan2016introduction). In Online Convex Optimization, learning progresses in rounds where on round the learner receives an input and then uses the matrix to form a prediction denoted . After making the prediction, the true outcome
is revealed. The discrepancy between the true and predicted outcomes is assessed through a loss function
which takes values in . The learner then uses the discrepancy to update the matrix to and prepare for the next round. For instance, the input on round can be an example for which the learner predicts where and the loss is a function such as or .Stochastic gradient methods use the gradient , thus naturally as the parameters are shaped as a matrix . The Shampoo algorithm tracks two statistics over the course of its run, and which are defined as follows,
Note that , while . These matrices are used to precondition gradient and update , as follows:
The primary complexity of Shampoo arises from computing and which was computed using singular value decomposition which is expensive.
2.3 Modern neural network training
Neural networks today are typically trained with minibatch gradient descent. Modern accelerators such as GPUs and TPUs allow us to scale up neural network training by parallelizing the minibatch forward and backward propagation calculations across many devices, commonly referred to as dataparallelism (dean2012). These devices have fast communication links between them to aggregate and broadcast the gradients. Moreover, the vast majority of the models trained today use the synchronous version of minibatch gradient descent, where all devices coordinate to make the update, and see the same values of the parameters at every step. Parameters in the data parallel case are replicated across all the individual device memories. An alternate strategy is to divide up the parameters across several devices, where the placement policy takes into account the computational graph so that multiple parts of the models can run concurrently—this is referred to as model parallelism.
3 Fullmatrix Preconditioning: Challenges
There were several challenges and design considerations in the development of the implementation of the distributed training system for Shampoo. These mainly arose from the fact that modern accelerators are highly optimized for training using firstorder optimizers, which have low computational and memory requirements. The Shampoo algorithm is computationally expensive, and could become prohibitive for large models.
The extra overheads of Shampoo compared to standard firstorder methods are in the following steps:

Preconditioner statistics computation:

Inverse ’th root computation:

Preconditioned gradient computation:
As we will show later in Section 6, the computation of the secondorder statistics and the preconditioned gradient does not add significantly to the runtime of each step. However computing the inverse ’th roots is very slow—as much as 100 times the step time in some cases—and performing these without slowing down the training was the main challenge in our system.
3.1 Algorithmic challenges
Large layers.
Modern ML architectures often use very large embedding layers, where the longer dimension can be in the millions. The Shampoo algorithm required computing a preconditioner for each dimension, but neither computing nor storing a million times million matrix is feasible. We had to extend the algorithm to allow us to choose which dimensions to precondition. In addition, the very largest models occasionally have large fully connected layers. In Section 5 we show that partitioning a large tensor into smaller blocks and preconditioning each block is feasible, and does not impact accuracy significantly.
Delayed preconditioners.
As remarked above, computing the preconditioners is the most expensive computation in every Shampoo step. In Section 6 we show that we can compute the preconditioners once every few hundred steps without a significant effect on the accuracy which indicates that the the loss function landscape does not change significantly with each step.
3.2 Numerical challenges
Inverse ’th roots (where typically ) can be computed using SVD, but there are efficient iterative algorithms such as the SchurNewton algorithm (guo2006schur) that can compute the inverse ’th root as a sequence of matrixvector and matrixmatrix products, which are highly optimized on modern accelerators. However, our experiments suggest that on real workloads the condition numbers of the matrices are very large (see Fig. 2) so both SVD and SchurNewton must be run in doubleprecision, but this is very expensive on accelerators.
3.3 Infrastructural challenges
Heterogeneous training hardware.
Neural network accelerators are custom designed to run machine learning workloads faster and at lower cost. Accelerator design is trending towards preferring lowerprecision (8bit/16bit) arithmetic that satisfy both of these goals on existing benchmarks. Our method demands doubleprecision arithmetic as described above, which makes running computation on accelerators a nonstarter, and therefore we had to design the system to leverage the existing underutilized CPUs of the training system to develop an effective implementation, described in Section 4.1.
API inflexibility.
Deep learning libraries such as TensorFlow
(tensorflow) offer APIs for optimizer implementation that are well suited for firstorder optimizers and for minibatch training. Our design requires that we interact with the training loop in nonstandard ways, which requires framework level changes. Our experiments were carried out using Lingvo (lingvo) and required changes to the training loop such as distributing computation to CPUs. We expect that this demonstration of the utility of fullmatrix preconditioning will encourage the development of more flexible API’s to fully utilize heterogeneous hardware.Memory available on training hardware.
Neural network accelerators typically have 8 to 32Gib onboard memory today. However, recent progress in natural language processing has demonstrated the value of inflating the model size from hundreds of millions to billions of parameters. For these large models the optimizer overhead of having auxiliary variables for gradient statistics and preconditioners can restrict training by forcing smaller minibatch sizes, or prohibiting some optimizers altogether.
4 Distributed System Design
Our method is designed to run effectively on modern neural network accelerators such as TPUs (jouppi2017datacenter) or GPUs. We first describe the standard paradigm of data parallelism used in training models on these accelerators. Each core of the accelerator computes forward propagation and back propagation on a subbatch (a subset of a minibatch, which itself is a small randomly selected subset of the training set) of input examples, followed by gradient aggregation for computing the minibatch gradient that requires an averaging of the gradients from all cores via allreduction. The aggregated gradient is then used for weight updates. The forward propagation and back propagation are run in parallel across all cores available on the system.
Allreduction adds a barrier and all the cores synchronize to aggregate the minibatch gradients from subbatches and apply the weight update. In Fig. 3 we measure the overheads of each of the steps on a Transformer model (vaswani2017attention) described in the experiment section. We observe that the overheads from allreduction and weight updates are a minor part () of the overall step time.
4.1 Exploiting the heterogeneity of the distributed training hardware
The overall design of our implementation is illustrated by the timeline in Fig. 4. As discussed in the previous section the preconditioner computation (inverse th root) is expensive and requires double precision. Here we exploit the heterogeneity in the distributed training hardware by utilizing a key resource that is typically available in the distributed training architectures—the central processing units (CPUs) on the machines to which the accelerator such as GPUs or Cloud TPUs are attached. These CPUs are responsible for gathering and processing training data, and auxiliary activities such as checkpointing and summarization of training state. They are often idle or at low utilization while the accelerator is running the training loop, and offer double precision arithmetic automatically, which makes them a perfect choice to run the preconditioner computation without adding any extra cost to the training run.
As mentioned above, we run the preconditioner computation every few hundred steps and make use of a stale version of the preconditioner in the training loop until a fresher version of the preconditioner becomes available, see Fig. 9 for empirical justification. Moreover, the computation is pipelined and runs asynchronously without blocking the training loop. As preconditioners need to be computed for every layer of the network, we distribute the computation across all the CPUs that are part of the training system. As a result, the most expensive step in Shampoo adds almost nothing to the overall training time!
5 Algorithmic Modifications
We now describe two simple enhancements to Shampoo that are critical to make it practical for large models.
5.1 Decoupling the step size and the direction
We empirically observed that Shampoo updates give directions that are superior to diagonalAdaGrad, alas the pertensor (“layerparameters”) scale of the learning rates caused numerical and training instabilities. Our solution is to run diagonal AdaGrad, which is inexpensive to compute, in parallel. We derive the learning rate for the update of each tensor by ensuring it is on par with the diagonal counterpart. Concretely, the weight matrix is updated as follows,
where is the elementwise power, . In words, we first compute diagonal statistics. We then set the learning rate to be the ratio of the norms of the preconditioned gradients according to diagonal AdaGrad and Shampoo. This ensures that the learning rate would be on par with that of diagonal AdaGrad. Last, we update the parameters using Shampoo with the (automatically) rescaled learning rate.
5.2 Preconditioning large tensors
Each preconditioning step of Shampoo requires time where
is the largest dimension of the tensor. While this is better than fullmatrix AdaGrad, for large layers such as the embedding and softmax layers in language models, it is intractable to compute Shampoo preconditioners. These layers typically are of the shape vocabularysize
embeddingdimension. For instance, for our machine translation experiments we use a vocabulary size of 32000 and 512 embedding dimensions. There is a large storage and computational cost associated with the preconditioner along the vocabulary axis. For example, computing the inverse ’th root of a matrix would take hours.In order to retain the benefits of preconditioning for these layers, we bypass preconditioning of excessively large dimensions. The following result allows us to use any subset of preconditioners as long as their exponents sum up to .
Lemma 1.
Assume that are matrices of rank at most . Let and define
Let be defined as above,
Then, the following properties hold:

[label=(0),nosep]

and ;

for any such that , we have .
Proof.
Both inequalities of (1) were proven as part of Lemma 8 of shampooicml. By using Ando’s inequality (ando2004geometric), we get
which concludes the proof.
An immediate consequence is that for any such that , we can employ preconditioning of the form
Further by choosing and we obtain the simple preconditioned gradients,
Our choice is empirically supported by the experiments shown in Fig. 7 which suggest that there is a benefit from preconditioning the large softmax and embedding layers with minimal increase in time as shown in Fig. 5.
5.3 Preconditioning blocks from large tensors
To reduce the computational cost of computing statistics and preconditioned gradient a natural extension is to divide the tensor into blocks and treating individual block as a separate tensor. Concretely this would entail dividing tensor , into such that . We ran experiments to partition intermediate layers into blocks which we observe emperically to have minimal impact on quality of the solution while providing faster step time Fig. 8.
6 Experiments
We compare our method against various well known optimization algorithms for training large stateoftheart deep models on several domains. Code and details on hyperparameter tuning is provided in the supplementary material.
6.1 Machine Translation with a Transformer
We demonstrate the effectiveness of our implementation on the standard machine translation dataset from WMT’14 English to French (enfr) with 36.3M sentence pairs. We used the stateoftheart Transformer architecture (vaswani2017attention). This architecture contains 93.3M parameters and consists of 6 layers for its encoder and decoder. Each layer is composed of 512 model dimensions, 2048 hidden dimensions, and 8 attention heads. The model makes use of a subword vocabulary that contains 32K word pieces (schuster12). The experiment was run on 32 cores of a Cloud TPU v3 Pod, and the implementation of the optimizer was carried out in the Lingvo (lingvo) sequence to sequence modeling based on TensorFlow. Our results are shown in Fig. 6: our algorithm achieves the same accuracy as AdaGrad or Adam in about half as many steps.
Preconditioning of embedding and softmax layers:
Following the first methodology discussed in Section 5.2 the algorithm preconditions the large layers with only one of the preconditioners ( or ) to make it tractable. Fig. 5 shows the increase in step time is only 6% while Fig. 7 shows that we can reduce the number of steps to convergence by 20%.
Reducing overhead in fullyconnected layers:
Following the second methodology discussed in Section 5.2 we ran two experiments where we partitioned fully connected layer of size [512, 2048] into two blocks of size [512, 1024] and four blocks of size [512, 512]. Our experiments show no drop in quality under this approximation with a small reduction in runtime ().
Effect of delayed computation of preconditioners:
The frequency of preconditioner updates is a tunable parameter in our design, that tradesoff steptime performance with solution quality. Experimentation on this tunable parameter revealed that our method can tolerate delays up to 1200 steps without any noticeable quality loss as; see Fig. 9.
(a) of attention layer;  (b) of attention layer;  (b) of embedding layer. 
On the sign changes in the preconditioned gradient:
We visualize the preconditioners of attention layer and embeddding layers Fig. 10 where we see rich structure that’s nondiagonal (axisparallel). The resulting preconditioned gradient is rotated and scaled (instead of just axisparallel scaling of firstorder adaptive methods). We also found that on average 30% of all the coordinates change sign when comparing the preconditioned gradient with the gradient.
Learning rates schedules:
For the Transformer experiments, we fixed the warmup schedule as well the decay schedules for Adam. For the smaller Transformer experiments, we tuned the hyperparameters for each of the algorithms over 100 trials. We took the best settings found for the momentum and secondmoment parameters, and tuned the learning rates until either the model becomes unstable, or does not increase performance. As Shampoo uses layerwise learning rate scales from AdaGrad, we found that for the exact same hyperparameter settings, Shampoo provides a modest improvement in performance. Moreover, Shampoo allows for larger learning rates than AdaGrad does, as shown in
Fig. 11.6.2 TransformerBig model
We also ran experiments with a larger Transformer model. This model contains 375.4M parameters and consists of 6 layers for its encoder and decoder. Each layer is composed of 1024 model dimensions, 8192 hidden dimensions, and 16 attention heads. Results are presented in Fig. 12 where again we see an improvement in the endtoend wallclock time. For the softmax, embedding and the projection fullyconnected layer (with 8192 hidden dimensions) we only make use of the left preconditioner. We note that step time is dominated by the preconditioned gradient computation which can be reduced by subblocking the layers. However, we ran into a compiler limitation due to the increased number of nodes; we will address this in future work.
On the overhead of the optimizer:
We capture the computational and memory complexity under various schemes described in Section 5.2 of handling large layers in Table 1. We note that the overhead from computing the statistics, as well as from computing the preconditioned update for single step of training, can be further reduced by increasing the batch sizes (indeed, these overheads are independent of the batch size) as shown in Fig. 13 where the overhead dramatically reduces from 40% to 19%.
Type  Computation  Memory 

All preconditioner :  
Left only preconditioner for :  
Preconditioner: block size 
6.3 Image Classification
Finally, we trained a ResNet50 model (resnet)
on the ImageNet2012
(russakovsky2015imagenet)dataset and compared it against the stateoftheart baseline using SGD+Momentum. Models were trained at a batch size of 4096 and for 90 epochs with L2 regularization of
and label smoothing . The learning rate was warmed up over the first 5 epochs followed by decay schedule where the learning rate is reduced by a factor of 10 at {30, 60, 90} epochs.Our results are presented in Tables 2 and 14. We find that Shampoo does not provide any improvement on test loss or accuracies. However, we see that our method is able to reduce the training loss faster than a welltuned SGD+Momentum baseline. The worse generalization of adaptive algorithms has been also discussed in GGT and in requires additional regularization. We leave this as future work, as part of goal to analyze the interplay between architectural choices and preconditioning.
Optimizer  Top1 Accuracy  Top5 Accuracy 

SGD+Momentum  76.43  93.23 
Shampoo  75.25  92.27 
Adagrad  73.72  91.55 
7 Conclusion
We presented a practical implementation of the Shampoo secondorder algorithm. On a stateoftheart Transformer model, our method reduces the overall wallclock time up to 40%, compared to the fastest firstorder methods. Our future work is in understanding the interplay between architecture choices, regularization, and preconditioning. We suspect that a joint search of these hyperparameters could reveal insights that could allow us to build more efficient networks.
References
Appendix A Implementation Details of Shampoo
Our implementation of the Shampoo algorithm for fullyconnected layers is described in Algorithm I. The algorithm can use heavyball momentum for its updates, as well an exponential moving average over the preconditioners, like Adam. The configuration parameter denotes the number of steps between subsequent fetches of the latest available preconditioner by the accelerator. The parameter must be set sufficiently high so that there is enough time for the CPU to complete the computation of the preconditioner asynchronously and pipeline it efficiently, but otherwise its setting does not have a significant effect on convergence.
Appendix B Further Details on Experiments
Experiment  Optimizer  Batch  Optimizer Parameters  Warmup 
Transformer  Adam  1536  , ,  40k steps 
Adagrad  1536  ,  40k steps  
Shampoo  1536  , , ,  40k steps  
TransformerBig  Adam  384  , ,  40k steps 
Adagrad  384  ,  40k steps  
Shampoo  384  , , ,  40k steps  
TransformerBig  Adagrad  1536  ,  40k steps 
Shampoo  1536  , , ,  40k steps  
ResNet50  SGD  4096  (staircase) ,  5 epochs 
AdaGrad  4096  (staircase) ,  5 epochs  
Shampoo  4096  (staircase) ,  5 epochs  
, , 
b.1 Transformer model on WMT’14 enfr
For all optimizers, we make use of a warmup schedule where the learning rate is increased from 0.0 to over 40k steps. For the smaller transformer experiments, we use a quadratic warmup, and for the larger transformer experiments we use a linear warmup. We found that quadratic warmup improves all optimizers equally and provides a better logperplexity. For the Adam optimizer experiments, we use a learning rate decay schedule of the form , following the suggestion of vaswani2017attention.
b.2 ResNet50 on ImageNet
For SGD with Momentum, the learning rate is warmed up over the first 5 epochs from 0 to 1.6, followed by a 10x drops of the learning rate at 30, 60 and 80 epochs. For AdaGrad and Shampoo, we change the peak learning rate to be 0.3075, 0.64, and weight decay of but follow the same staircase decay scheme as SGD with Momentum. For all optimizers, we grid search L2 regularization parameter in the range and peak learning rate between 0.016 to 16.