Second Order Optimization Made Practical

by   Rohan Anil, et al.
Princeton University

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods that involve second-order derivatives and/or second-order statistics of the data have become far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a proof-of-concept distributed system implementation of a second-order preconditioned method (specifically, a variant of full-matrix Adagrad), that along with a few yet critical algorithmic and numerical improvements, provides significant practical gains in convergence on state-of-the-art deep models and gives rise to actual wall-time improvements in practice compared to conventional first-order methods. Our design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models which consists of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance on very large learning problems in machine translation where our distributed implementation runs considerably faster than existing gradient-based methods.


Scalable Second Order Optimization for Deep Learning

Optimization in machine learning, both theoretical and applied, is prese...

Exact Stochastic Second Order Deep Learning

Optimization in Deep Learning is mainly dominated by first-order methods...

Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible

Machine learning is predicated on the concept of generalization: a model...

Fast Gradient Methods with Alignment for Symmetric Linear Systems without Using Cauchy Step

The performance of gradient methods has been considerably improved by th...

Optimization Methods for Large-Scale Machine Learning

This paper provides a review and commentary on the past, present, and fu...

Memory-Efficient Adaptive Optimization for Large-Scale Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among th...

ES-Based Jacobian Enables Faster Bilevel Optimization

Bilevel optimization (BO) has arisen as a powerful tool for solving many...

1 Introduction

Second-order gradient methods are among the most powerful algorithms in mathematical optimization. Algorithms in this family use a preconditioner matrix to transform the gradient before applying each step. Classically, this involves computing or approximating the matrix of second-order derivatives, i.e, the Hessian, in the context of exact deterministic optimization (e.g., 

fletcher2013practical; lewis2013nonsmooth; nocedal1980updating). In contrast, AdaGrad (duchi2011adaptive) and related algorithms that target stochastic optimization use the covariance matrix of second-order gradient statistics to form the preconditioner.

While second-order methods often have significantly better convergence properties than first-order methods, the size of typical problems prohibits their use in practice, as they require quadratic storage and cubic computation time for each gradient update. Thus, these methods not commonly seen in the present practice of optimization in machine learning, which is largely dominated by the simpler to implement first-order methods. Arguably, one of the greatest challenges of modern optimization is to bridge this gap between the theoretical and practical optimization and make second-order optimization more feasible to implement and deploy.

In this paper, we attempt to contribute towards narrowing this gap between theory and practice, focusing on second-order adaptive methods. These methods can be thought of as full-matrix analogues of common adaptive algorithms of the family of AdaGrad (duchi2011adaptive) and Adam (kingma2014adam). The methods maintain a matrix, akin to a covariance matrix that accumulates the outer products of the stochastic gradients, which is used by the algorithm to precondition gradient at each step. These full-matrix versions are potentially more powerful than first-order methods as they can exploit statistical correlations between (gradients of) different parameters, but at the same time, suffer from the said prohibitive runtime and memory costs.

Recent developments in the space of second-order methods, on which we focus in this paper, include the K-FAC (kfac) and Shampoo (shampoo-icml)

algorithms that exploits the structure of deep networks (and more generally, models described by a collection of tensors) for mitigating the space and runtime costs of full-matrix second-order algorithms. These methods approximate each preconditioning matrix using a factored representation that stems from the network structure. However, in very large applications, such algorithms are still impractical due to their serial nature, as well as due to a number of numerical and infrastructural pitfalls that face any attempt to implement a full-matrix optimization method.

1.1 Our contributions

We provide solutions to practical concerns and challenges that arise in implementing and using second-order methods in immense scale. Our concrete focus will be on the Shampoo algorithm, but most of the challenges we address are faced by any attempt to implement a second-order method. These include:

  • [nosep]

  • We replace expensive spectral decompositions (SVD) with an efficient iterative method for computing roots of PSD matrices.

  • To further mitigate the runtime cost of computing the above root computation, we design and implement an asynchronous version of the algorithm. This approach exploits the heterogeneity and computing power of CPU-Accelerator coupled architectures.

  • We extend Shampoo in a number of ways so as to make it applicable to a larger range of deep architectures; in particular, this extension facilitates the usage of the algorithm for training very large embedding layers.

  • We describe practical challenges and limitations of the algorithm in its current form, which we argue could be useful in the design of next generation hardware architecture with increased on-chip memory and higher precision matrix multiplies.

Applying our novel distributed implementation of the modified algorithm, we demonstrate superior performance on very large training problems in machine translation. Our implementation achieves up to 1.67x speedup in training time compared to the best published optimizer. Moreover, our implementation runs considerably faster (in total wall-clock time) than any first order method while achieving on par or better accuracy.

To the best of our knowledge, our design and implementation is the first to demonstrate the power and scalability of second-order methods in practice, with its actual wall-time beating all widely-used first-order methods on problems with very large number of parameters and examples.

1.2 Related work

Various approximations to the preconditioning matrix have been proposed in the recent literature, (e.g., gonen2015faster; erdogdu2015convergence; agarwal2016second; xu2016sub; pilanci2017newton). However, so far the only prevalent and pragmatic approximation is the diagonal approximation used by widespread (often adaptive) optimizers.

Some recent approaches for approximating the full-matrix preconditioner are K-FAC (kfac), Shampoo (shampoo-icml) and GGT (GGT). K-FAC uses a factored approximation of the Fisher-information matrix as a preconditioner. While our focus in this paper is on the Shampoo algorithm, we believe that many of the techniques presented here could also be applied to make K-FAC practical in large scale. GGT uses a clever trick to compute a low-rank approximation to the AdaGrad preconditioner. However, GGT maintains several hundred copies of the gradient in memory, which is too expensive even for mid-sized models.

1.3 Paper organization

The rest of the paper is organized as follows. In Section 2 we provide some background on preconditioning methods and describe the Shampoo algorithm. We next discuss the various challenges one faces in a practical implementation of a second-order methods in Section 3. In Section 4

we describe the design of the distributed version of Shampoo with accelerators for deep learning, and then describe the improvements we made to Shampoo to make it work with our system. Finally, in 

Section 6 we describe experiments on several datasets, showing that our implementation significantly outperforms common first-order methods such as SGD, Adam and AdaGrad.

1.4 Notation

We will use lowercase letters to denote scalars and vectors, and uppercase letters to denote matrices. We use

to denote the Loewner order: given square symmetric matrices , we write iff is positive semidefinite (PSD). Given a symmetric PSD matrix , and , is defined as follows: let

be the singular value decomposition of

, where

is a unitary matrix and

is a diagonal matrix (with as is PSD), then , where . We use to denote the Hadamard or element-wise product of and which have the same shape, so . will denote the Kronecker product of any two matrices and . We use to denote the flattening of the matrix : if has rows , then is the column vector . denotes the Frobenius norm of : .

2 Background

2.1 Adaptive preconditioning methods

First order methods iteratively update the parameters solely based on gradient information: where and are (column) vectors in . Here denotes a linear combination of the current and past gradients , where different algorithms use different combinations. In contrast, preconditioned methods take the following form: where is an matrix. Whereas in Newton-type methods this matrix is related to the Hessian matrix of second-order derivatives, adaptive gradient methods form their preconditioning matrix based on gradient-gradient correlations.

The parameters of a deep network form a set where each element of the set is typically an order two (i.e. a matrix), three, or four tensor. For simplicity of the presentation we focus on the matrix case—however our design, analysis, and implementation holds for tensors of arbitrary order. We denote the space of parameters by the matrix

and an estimate of the gradient at

by .

A full matrix preconditioning would flatten and represent it as a vector of dimension . It thus requires space and would take time to perform the update. Even if we focus merely on a single layer of a deep network, and would be in the 1000’s in state-of-the-art models thus rendering full-matrix preconditioning impractical. For this reason, AdaGrad and analogously Adam, constrain the preconditioning matrices to be diagonal. Shampoo bridges the gap between full matrix preconditioning and the diagonal version by approximating the matrices.

2.2 The Shampoo algorithm

As noted above, we describe the Shampoo algorithm for matrix-shaped parameter spaces, for brevity. All of our modifications to Shampoo are extended and implemented for tensors of arbitrary dimension.

We describe Shampoo in the context of the Online Convex Optimization framework, which is closely related to (in fact, generalizes and extends) stochastic optimization (see, e.g., shalev2012online; hazan2016introduction). In Online Convex Optimization, learning progresses in rounds where on round the learner receives an input and then uses the matrix to form a prediction denoted . After making the prediction, the true outcome

is revealed. The discrepancy between the true and predicted outcomes is assessed through a loss function

which takes values in . The learner then uses the discrepancy to update the matrix to and prepare for the next round. For instance, the input on round can be an example for which the learner predicts where and the loss is a function such as or .

Stochastic gradient methods use the gradient , thus naturally as the parameters are shaped as a matrix . The Shampoo algorithm tracks two statistics over the course of its run, and which are defined as follows,

Note that , while . These matrices are used to precondition gradient and update , as follows:

The primary complexity of Shampoo arises from computing and which was computed using singular value decomposition which is expensive.

2.3 Modern neural network training

Neural networks today are typically trained with mini-batch gradient descent. Modern accelerators such as GPUs and TPUs allow us to scale up neural network training by parallelizing the mini-batch forward and backward propagation calculations across many devices, commonly referred to as data-parallelism (dean2012). These devices have fast communication links between them to aggregate and broadcast the gradients. Moreover, the vast majority of the models trained today use the synchronous version of mini-batch gradient descent, where all devices coordinate to make the update, and see the same values of the parameters at every step. Parameters in the data parallel case are replicated across all the individual device memories. An alternate strategy is to divide up the parameters across several devices, where the placement policy takes into account the computational graph so that multiple parts of the models can run concurrently—this is referred to as model parallelism.

3 Full-matrix Preconditioning: Challenges

There were several challenges and design considerations in the development of the implementation of the distributed training system for Shampoo. These mainly arose from the fact that modern accelerators are highly optimized for training using first-order optimizers, which have low computational and memory requirements. The Shampoo algorithm is computationally expensive, and could become prohibitive for large models.

The extra overheads of Shampoo compared to standard first-order methods are in the following steps:

  • Preconditioner statistics computation:

  • Inverse ’th root computation:

  • Preconditioned gradient computation:

As we will show later in Section 6, the computation of the second-order statistics and the preconditioned gradient does not add significantly to the runtime of each step. However computing the inverse ’th roots is very slow—as much as 100 times the step time in some cases—and performing these without slowing down the training was the main challenge in our system.

3.1 Algorithmic challenges

Large layers.

Modern ML architectures often use very large embedding layers, where the longer dimension can be in the millions. The Shampoo algorithm required computing a preconditioner for each dimension, but neither computing nor storing a million times million matrix is feasible. We had to extend the algorithm to allow us to choose which dimensions to precondition. In addition, the very largest models occasionally have large fully connected layers. In Section 5 we show that partitioning a large tensor into smaller blocks and preconditioning each block is feasible, and does not impact accuracy significantly.

Delayed preconditioners.

As remarked above, computing the preconditioners is the most expensive computation in every Shampoo step. In Section 6 we show that we can compute the preconditioners once every few hundred steps without a significant effect on the accuracy which indicates that the the loss function landscape does not change significantly with each step.

3.2 Numerical challenges

Inverse ’th roots (where typically ) can be computed using SVD, but there are efficient iterative algorithms such as the Schur-Newton algorithm (guo2006schur) that can compute the inverse ’th root as a sequence of matrix-vector and matrix-matrix products, which are highly optimized on modern accelerators. However, our experiments suggest that on real workloads the condition numbers of the matrices are very large (see Fig. 2) so both SVD and Schur-Newton must be run in double-precision, but this is very expensive on accelerators.

Figure 1: Benchmarks on computing inverse-pth root for statistics of varying dimensions. We find that the Schur-Newton iterative method can effectively utilize the CPUs and give large walltime improvements compared to SVD (that relies on bidiagonal divide-and-conquer). These were measured on Intel Skylake CPUs.
Figure 2: Condition number for of a layer in the transformer model over time.

3.3 Infrastructural challenges

Heterogeneous training hardware.

Neural network accelerators are custom designed to run machine learning workloads faster and at lower cost. Accelerator design is trending towards preferring lower-precision (8-bit/16-bit) arithmetic that satisfy both of these goals on existing benchmarks. Our method demands double-precision arithmetic as described above, which makes running computation on accelerators a non-starter, and therefore we had to design the system to leverage the existing underutilized CPUs of the training system to develop an effective implementation, described in Section 4.1.

API inflexibility.

Deep learning libraries such as TensorFlow

(tensorflow) offer APIs for optimizer implementation that are well suited for first-order optimizers and for mini-batch training. Our design requires that we interact with the training loop in non-standard ways, which requires framework level changes. Our experiments were carried out using Lingvo (lingvo) and required changes to the training loop such as distributing computation to CPUs. We expect that this demonstration of the utility of full-matrix preconditioning will encourage the development of more flexible API’s to fully utilize heterogeneous hardware.

Memory available on training hardware.

Neural network accelerators typically have 8 to 32-Gib on-board memory today. However, recent progress in natural language processing has demonstrated the value of inflating the model size from hundreds of millions to billions of parameters. For these large models the optimizer overhead of having auxiliary variables for gradient statistics and preconditioners can restrict training by forcing smaller mini-batch sizes, or prohibiting some optimizers altogether.

4 Distributed System Design

Our method is designed to run effectively on modern neural network accelerators such as TPUs (jouppi2017datacenter) or GPUs. We first describe the standard paradigm of data parallelism used in training models on these accelerators. Each core of the accelerator computes forward propagation and back propagation on a sub-batch (a subset of a mini-batch, which itself is a small randomly selected subset of the training set) of input examples, followed by gradient aggregation for computing the mini-batch gradient that requires an averaging of the gradients from all cores via all-reduction. The aggregated gradient is then used for weight updates. The forward propagation and back propagation are run in parallel across all cores available on the system.

All-reduction adds a barrier and all the cores synchronize to aggregate the mini-batch gradients from sub-batches and apply the weight update. In Fig. 3 we measure the overheads of each of the steps on a Transformer model (vaswani2017attention) described in the experiment section. We observe that the overheads from all-reduction and weight updates are a minor part () of the overall step time.

Figure 3: Latency per step for a Transformer model with Diagonal AdaGrad optimizer is 134ms, with the breakdown: (a) forward prop: 57ms; (b) backward prop: 71ms; (c) all reduction: 4ms; and (d) weight updates: 2ms.

4.1 Exploiting the heterogeneity of the distributed training hardware

Figure 4: Timeline which illustrates the design of the optimization algorithm. Preconditioner statistics for all tensors ( and ) are computed at each step. Preconditioners are only computed every steps and this computation is distributed to all CPU cores available in the training system. The operations are pipelined such that overheads are amortized.

The overall design of our implementation is illustrated by the timeline in Fig. 4. As discussed in the previous section the preconditioner computation (inverse th root) is expensive and requires double precision. Here we exploit the heterogeneity in the distributed training hardware by utilizing a key resource that is typically available in the distributed training architectures—the central processing units (CPUs) on the machines to which the accelerator such as GPUs or Cloud TPUs are attached. These CPUs are responsible for gathering and processing training data, and auxiliary activities such as check-pointing and summarization of training state. They are often idle or at low utilization while the accelerator is running the training loop, and offer double precision arithmetic automatically, which makes them a perfect choice to run the preconditioner computation without adding any extra cost to the training run.

As mentioned above, we run the preconditioner computation every few hundred steps and make use of a stale version of the preconditioner in the training loop until a fresher version of the preconditioner becomes available, see Fig. 9 for empirical justification. Moreover, the computation is pipelined and runs asynchronously without blocking the training loop. As preconditioners need to be computed for every layer of the network, we distribute the computation across all the CPUs that are part of the training system. As a result, the most expensive step in Shampoo adds almost nothing to the overall training time!

5 Algorithmic Modifications

We now describe two simple enhancements to Shampoo that are critical to make it practical for large models.

5.1 Decoupling the step size and the direction

We empirically observed that Shampoo updates give directions that are superior to diagonal-AdaGrad, alas the per-tensor (“layer-parameters”) scale of the learning rates caused numerical and training instabilities. Our solution is to run diagonal AdaGrad, which is inexpensive to compute, in parallel. We derive the learning rate for the update of each tensor by ensuring it is on par with the diagonal counterpart. Concretely, the weight matrix is updated as follows,

where is the element-wise power, . In words, we first compute diagonal statistics. We then set the learning rate to be the ratio of the norms of the preconditioned gradients according to diagonal AdaGrad and Shampoo. This ensures that the learning rate would be on par with that of diagonal AdaGrad. Last, we update the parameters using Shampoo with the (automatically) rescaled learning rate.

5.2 Preconditioning large tensors

Each preconditioning step of Shampoo requires time where

is the largest dimension of the tensor. While this is better than full-matrix AdaGrad, for large layers such as the embedding and softmax layers in language models, it is intractable to compute Shampoo preconditioners. These layers typically are of the shape vocabulary-size

embedding-dimension. For instance, for our machine translation experiments we use a vocabulary size of 32000 and 512 embedding dimensions. There is a large storage and computational cost associated with the preconditioner along the vocabulary axis. For example, computing the inverse ’th root of a matrix would take hours.

In order to retain the benefits of preconditioning for these layers, we bypass preconditioning of excessively large dimensions. The following result allows us to use any subset of preconditioners as long as their exponents sum up to .

Lemma 1.

Assume that are matrices of rank at most . Let and define

Let be defined as above,

Then, the following properties hold:

  1. [label=(0),nosep]

  2. and ;

  3. for any such that , we have .


Both inequalities of (1) were proven as part of Lemma 8 of shampoo-icml. By using Ando’s inequality (ando2004geometric), we get

which concludes the proof.

An immediate consequence is that for any such that , we can employ preconditioning of the form

Further by choosing and we obtain the simple preconditioned gradients,

Our choice is empirically supported by the experiments shown in Fig. 7 which suggest that there is a benefit from preconditioning the large softmax and embedding layers with minimal increase in time as shown in Fig. 5.

5.3 Preconditioning blocks from large tensors

To reduce the computational cost of computing statistics and preconditioned gradient a natural extension is to divide the tensor into blocks and treating individual block as a separate tensor. Concretely this would entail dividing tensor , into such that . We ran experiments to partition intermediate layers into blocks which we observe emperically to have minimal impact on quality of the solution while providing faster step time Fig. 8.

6 Experiments

We compare our method against various well known optimization algorithms for training large state-of-the-art deep models on several domains. Code and details on hyper-parameter tuning is provided in the supplementary material.

6.1 Machine Translation with a Transformer

We demonstrate the effectiveness of our implementation on the standard machine translation dataset from WMT’14 English to French (enfr) with 36.3M sentence pairs. We used the state-of-the-art Transformer architecture (vaswani2017attention). This architecture contains 93.3M parameters and consists of 6 layers for its encoder and decoder. Each layer is composed of 512 model dimensions, 2048 hidden dimensions, and 8 attention heads. The model makes use of a sub-word vocabulary that contains 32K word pieces (schuster12). The experiment was run on 32 cores of a Cloud TPU v3 Pod, and the implementation of the optimizer was carried out in the Lingvo (lingvo) sequence to sequence modeling based on TensorFlow. Our results are shown in Fig. 6: our algorithm achieves the same accuracy as AdaGrad or Adam in about half as many steps.

Figure 5: Detailed breakdown of latency of a single step. Diagonal AdaGrad optimizer: 134ms, our implementation of Shampoo: 145ms (for all layers except embedding and softmax layers) and 155ms (for all layers). As preconditioner computation is pipelined and distributed over CPUs it does not add any overhead, and transfer latency is minimal (100ms) and is amortized over hundreds of steps.
Figure 6: Test log-perplexity of a Transformer model on WMT’14 enfr, trained with batch size of 1536 on 32 cores of a Cloud TPU v3 Pod training system. The algorithm converges 1.95x faster in steps, while being only slower per step. This allows the method to attain a particular log-perplexity in 40% less wall-clock time.

Preconditioning of embedding and softmax layers:

Following the first methodology discussed in Section 5.2 the algorithm preconditions the large layers with only one of the preconditioners ( or ) to make it tractable. Fig. 5 shows the increase in step time is only 6% while Fig. 7 shows that we can reduce the number of steps to convergence by  20%.

Figure 7: Test log-perplexity of Transformer model on WMT’14 enfr with a per-core batch size of 48 (overall batch size of 1536) trained on 32 cores of a Cloud TPU v3 Pod with preconditioning only applied to all layers except embedding and softmax layers, vs. applied to all layers. We notice a large improvement in convergence in terms of number of steps with only a small 6% increase in step time.

Reducing overhead in fully-connected layers:

Following the second methodology discussed in Section 5.2 we ran two experiments where we partitioned fully connected layer of size [512, 2048] into two blocks of size [512, 1024] and four blocks of size [512, 512]. Our experiments show no drop in quality under this approximation with a small reduction in runtime ().

Figure 8: Test log-perplexity of Transformer model on WMT’14 enfr with a per-core batch size of 48 (overall batch size of 1536) trained on 32 cores of a Cloud TPU v3 Pod with fully connected layers partitioned into sub-blocks.

Effect of delayed computation of preconditioners:

The frequency of preconditioner updates is a tunable parameter in our design, that trades-off step-time performance with solution quality. Experimentation on this tunable parameter revealed that our method can tolerate delays up to 1200 steps without any noticeable quality loss as; see Fig. 9.

Figure 9: Solution quality at varying interval between preconditioner updates. The method can tolerate intervals of up to 1200 steps without any loss of quality.
(a) of attention layer; (b) of attention layer; (b) of embedding layer.
Figure 10: Illustration of the rich structure in the preconditioning matrices for a Transformer model. (Color intensities are in log scale.)

On the sign changes in the preconditioned gradient:

We visualize the preconditioners of attention layer and embeddding layers Fig. 10 where we see rich structure that’s non-diagonal (axis-parallel). The resulting preconditioned gradient is rotated and scaled (instead of just axis-parallel scaling of first-order adaptive methods). We also found that on average 30% of all the coordinates change sign when comparing the preconditioned gradient with the gradient.

Learning rates schedules:

For the Transformer experiments, we fixed the warmup schedule as well the decay schedules for Adam. For the smaller Transformer experiments, we tuned the hyperparameters for each of the algorithms over 100 trials. We took the best settings found for the momentum and second-moment parameters, and tuned the learning rates until either the model becomes unstable, or does not increase performance. As Shampoo uses layer-wise learning rate scales from AdaGrad, we found that for the exact same hyperparameter settings, Shampoo provides a modest improvement in performance. Moreover, Shampoo allows for larger learning rates than AdaGrad does, as shown in

Fig. 11.

Figure 11: Test log-perplexity of a Transformer-Big model on WMT’14 enfr, trained with batch size of 384 on 32 cores of a Cloud TPU v3 Pod training system. We demonstrate that Shampoo performs better than AdaGrad with the same hyper-parameter setting and allows for larger learning rates, and in turn for further improvements in convergence.

6.2 Transformer-Big model

We also ran experiments with a larger Transformer model. This model contains 375.4M parameters and consists of 6 layers for its encoder and decoder. Each layer is composed of 1024 model dimensions, 8192 hidden dimensions, and 16 attention heads. Results are presented in Fig. 12 where again we see an improvement in the end-to-end wall-clock time. For the softmax, embedding and the projection fully-connected layer (with 8192 hidden dimensions) we only make use of the left preconditioner. We note that step time is dominated by the preconditioned gradient computation which can be reduced by sub-blocking the layers. However, we ran into a compiler limitation due to the increased number of nodes; we will address this in future work.

Figure 12: Test log-perplexity of a Transformer-Big model on WMT’14 enfr, trained with batch size of 384 on 32 cores of a Cloud TPU v3 Pod training system. The algorithm converges x faster in steps, while being slower per step, this allows the method to attain a particular log-perplexity in 30% less wall-clock time.

On the overhead of the optimizer:

We capture the computational and memory complexity under various schemes described in Section 5.2 of handling large layers in Table 1. We note that the overhead from computing the statistics, as well as from computing the preconditioned update for single step of training, can be further reduced by increasing the batch sizes (indeed, these overheads are independent of the batch size) as shown in Fig. 13 where the overhead dramatically reduces from 40% to 19%.

Type Computation Memory
All preconditioner :
Left only preconditioner for :
Preconditioner: block size
Table 1: Computational and memory complexity of variants of Shampoo.
Figure 13: Test log-perplexity of a Transformer-Big model on WMT’14 enfr, trained with batch size of 1536 on 32 cores of a Cloud TPU v3 Pod training system. The increased batch size reduces the optimizer overhead from 40% to 19%, with a x improvement in steps to quality, the overall reduction in wall-time improves from 30% at batch size 384 to at batch size 1536.

6.3 Image Classification

Finally, we trained a ResNet-50 model (resnet)

on the ImageNet-2012


dataset and compared it against the state-of-the-art baseline using SGD+Momentum. Models were trained at a batch size of 4096 and for 90 epochs with L2 regularization of

and label smoothing . The learning rate was warmed up over the first 5 epochs followed by decay schedule where the learning rate is reduced by a factor of 10 at {30, 60, 90} epochs.

Our results are presented in Tables 2 and 14. We find that Shampoo does not provide any improvement on test loss or accuracies. However, we see that our method is able to reduce the training loss faster than a well-tuned SGD+Momentum baseline. The worse generalization of adaptive algorithms has been also discussed in GGT and in requires additional regularization. We leave this as future work, as part of goal to analyze the interplay between architectural choices and preconditioning.

Figure 14: Train (left) and test (right) cross-entropy on Imagenet-2012 with Resnet-50.
Optimizer Top-1 Accuracy Top-5 Accuracy
SGD+Momentum 76.43 93.23
Shampoo 75.25 92.27
Adagrad 73.72 91.55
Table 2: Test results on ImageNet with Resnet-50 trained at batch size 4096.

7 Conclusion

We presented a practical implementation of the Shampoo second-order algorithm. On a state-of-the-art Transformer model, our method reduces the overall wall-clock time up to 40%, compared to the fastest first-order methods. Our future work is in understanding the interplay between architecture choices, regularization, and preconditioning. We suspect that a joint search of these hyperparameters could reveal insights that could allow us to build more efficient networks.


Appendix A Implementation Details of Shampoo

Our implementation of the Shampoo algorithm for fully-connected layers is described in Algorithm I. The algorithm can use heavy-ball momentum for its updates, as well an exponential moving average over the preconditioners, like Adam. The configuration parameter denotes the number of steps between subsequent fetches of the latest available preconditioner by the accelerator. The parameter must be set sufficiently high so that there is enough time for the CPU to complete the computation of the preconditioner asynchronously and pipeline it efficiently, but otherwise its setting does not have a significant effect on convergence.

1:parameters: learning rate , momentum: ,
2:for   do
3:     Receive stochastic gradients for each layer
4:     if  then
5:          +
6:          +
7:     else
8:          +
9:          +      
12:     if  then
13:         Gather preconditioners from CPUs
14:         Send to CPU host to compute      
15:     if  then
19:     else
Algorithm I Practical Shampoo

Appendix B Further Details on Experiments

Experiment Optimizer Batch Optimizer Parameters Warmup
Transformer Adam 1536 , , 40k steps
Adagrad 1536 , 40k steps
Shampoo 1536 , , , 40k steps
Transformer-Big Adam 384 , , 40k steps
Adagrad 384 , 40k steps
Shampoo 384 , , , 40k steps
Transformer-Big Adagrad 1536 , 40k steps
Shampoo 1536 , , , 40k steps
ResNet-50 SGD 4096 (staircase) , 5 epochs
AdaGrad 4096 (staircase) , 5 epochs
Shampoo 4096 (staircase) , 5 epochs
, ,
Table 3: Hyperparameter setup used in our experiments.

b.1 Transformer model on WMT’14 enfr

For all optimizers, we make use of a warmup schedule where the learning rate is increased from 0.0 to over 40k steps. For the smaller transformer experiments, we use a quadratic warmup, and for the larger transformer experiments we use a linear warmup. We found that quadratic warmup improves all optimizers equally and provides a better log-perplexity. For the Adam optimizer experiments, we use a learning rate decay schedule of the form , following the suggestion of vaswani2017attention.

b.2 ResNet-50 on ImageNet

For SGD with Momentum, the learning rate is warmed up over the first 5 epochs from 0 to 1.6, followed by a 10x drops of the learning rate at 30, 60 and 80 epochs. For AdaGrad and Shampoo, we change the peak learning rate to be 0.3075, 0.64, and weight decay of but follow the same staircase decay scheme as SGD with Momentum. For all optimizers, we grid search L2 regularization parameter in the range and peak learning rate between 0.016 to 16.