1 Introduction
Largescale training of deep learning models with up to hundredbillion parameters is computationheavy and expensive
(Brown et al., 2020). In addition, another serious system challenge is the communication overhead. A recent study of BERT pretraining with Adam demonstrates that the allreduce communication can take up to 94% and 75% of total training time per step on clusters with Ethernet and InfiniBand internode connections, respectively (Tang et al., 2021).To achieve communication efficient distributed training, there are two promising directions: large batch optimization and communication compression. LAMB optimizer, which can be viewed as Adam with adaptive layerwise learning rates, is an example of large batch optimization (You et al., 2020). LAMB can scale the batch size of BERT pretraining to 64K without losing accuracy, thereby greatly reducing the total training time as larger batches require less number of communications. On the other hand, recent works on communication compression such as 1bit Adam demonstrates that it is possible to combine 1bit compression with Adam’s convergence speed, thereby reduce BERT pretraining communication volume by (Tang et al., 2021).
Both LAMB and 1bit Adam demonstrate great benefit for distributed training. Unfortunately, our studies show that simply using one of them is not sufficient to fully address the communication issue, especially under limited network bandwidth and large number of GPUs/machines. We find that communication is still a nontrivial overhead when running largescale distributed training with LAMB, even with the larger batch sizes. 1bit Adam has the same convergence speed as Adam (Tang et al., 2021), and previous study shows that Adam provides slower convergence speed compared to LAMB at batch sizes 16K or larger for BERT pretraining (You et al., 2020). Using You et al. (2020)’s methodology, our BERT pretraining experiments show that 1bit Adam, similar to Adam, also has slower convergence speed compared to LAMB at batch size 16K. Even with the communication compression, this batch size limitation would hurt the communication efficiency when the number of GPUs/machines is large.
LAMB and 1bit Adam are two unique optimizers. However, the techniques behind them are complementary: large batch optimization reduces number of communication, and compression reduces communication volume. Motivated by this we aim to combine LAMB’s large batch optimization algorithm with compression strategies behind 1bit Adam. However, we find that they are not directly compatible due to LAMB’s unique layerwise learning rates update strategy, which requires information that are missing when communication (and optimizer states) is compressed.
The studies and challenges above motivate us to design a new algorithm called 1bit LAMB. Learning from the insights behind 1bit Adam, 1bit LAMB is a 2stage algorithm which uses LAMB (warmup stage) to “precondition” a communication compressed momentum SGD algoirthm (compression stage). At compression stage where original LAMB algorithm cannot be used to update the layerwise learning rates, 1bit LAMB employs a novel way to adaptively scale layerwise learning rates based on information from both warmup and compression stages. As a result, 1bit LAMB is able to achieve large batch optimization (LAMB)’s convergence speed under compressed communication, which is impossible using existing approaches.
We implement the 1bit LAMB algorithm using exisiting MPIbased compressed communication backend (introduced in the 1bit Adam work), and a new NCCLbased compressed communication backend (introduced in this paper) which provides better usability and performance. We evaluate 1bit LAMB’s convergence and performance by BERT pretraining and GLUE/SQuAD finetuning tasks. Results show that under different batch sizes from to and with up to 256 GPUs, 1bit LAMB with NCCLbased backend is able to achieve up to communication volume reduction and up to endtoend speedup (in terms of number of training samples per second) for BERT pretraining compared to uncompressed LAMB, together with the same convergence speed (in terms of number of pretraining samples to reach the same accuracy on GLUE/SQuAD finetuning tasks).
We make the following contributions:

We conduct an extensive study of distributed training with LAMB and 1bit Adam, which provides detailed insights behind the algorithm and motivates our work. (Section 3)

We propose a new algorithm, 1bit LAMB, a communication efficient momentum SGD algorithm preconditioned with LAMB optimizer, which to the best of our knowledge is the first work that combines communication compression and large batch optimization. (Section 4)

We implement a custom collective communication primitive using NCCL backend of PyTorch distributed which provides better usability and performance than existing solutions. This backend can be applied to 1bit LAMB, 1bit Adam, and potentially other communication compression algorithms. (Section 5)

We conduct largescale convergence and performance experiments on BERT pretraining and GLUE/SQuAD finetuning, which demonstrate 1bit LAMB’s superior performance and competitive convergence speed compared to LAMB. (Section 6)

The 1bit LAMB optimizer as well as the NCCLbased communication backend has been open sourced in a deep learning optimization library called DeepSpeed
^{2}^{2}2https://github.com/microsoft/DeepSpeed, https://www.deepspeed.ai/.
2 Related Work and Background
2.1 Communication efficient distributed training
To achieve communication efficient distributed training, techniques include decentralization (Lian et al., 2017; Koloskova* et al., 2020; Li et al., 2018), asynchronous communication (Zheng et al., 2016; Chaturapruek et al., 2015), and gradient compression/quantization which we focus on in this paper. Before communication, we could compress the original gradient into , where is the compress operator^{3}^{3}3 could also include randomness.. As a result the communication volume could be greatly reduced. Compression can be achieved by quantization, sparsification, sketching, etc. (Ye and Abbe, 2018; Alistarh et al., 2017; Agarwal et al., 2018; Yu et al., 2019; Spring et al., 2019; Ivkin et al., 2019; Shi et al., 2021)
. Several works focus on unbiased compression methods (original and compressed tensors have the same expectation), such as centralized compressed parallel SGD
(Alistarh et al., 2017) and many others (Wangni et al., 2018; Shen et al., 2018; Zhang et al., 2017; Wen et al., 2017; Jiang and Agrawal, 2018). On the other hand, recent works about biased compression methods demonstrate better compression rate and the same convergence rate by using an error compensation technique (Seide et al., 2014; Bernstein et al., 2019; Stich et al., 2018; Zheng et al., 2019; Phuong and Phong, 2020; Yu et al., 2019; Shi et al., 2019; Ivkin et al., 2019; Sun et al., 2019; Basu et al., 2019; Vogels et al., 2019; Tang et al., 2021).The idea of using error compensation for compression is proposed in the 1bit SGD work (Seide et al., 2014): instead of compressing the gradient at each iteration directly, they compress the sum of the gradient and the last step’s compression error. They find that by using error compensation the training can achieve promising convergence speed even with 1bit compression (representing the gradient by signs and a scale). Recent works provide theoretical guarantee of this method (Bernstein et al., 2019), and also demonstrate that it admits the same asymptotic convergence rate as the uncompressed one (Stich et al., 2018). In addition, error compensation method enables almost any compression methods (Stich et al., 2018), either biased or unbiased, to converge as fast as the uncompressed case. And it is compatible with other techniques including decentralized training (Vogels et al., 2020), local SGD (Xie et al., 2020), and accelerated algorithms (Gorbunov et al., 2020).
2.2 Adam, 1bit Adam, and LAMB
Influenced by previous works that study adaptive learning rate (Adagrad (Duchi et al., 2011)
, RMSprop
(Tieleman and Hinton, 2011), Adadelta (Zeiler, 2012), etc.), Adam (Kingma and Ba, 2015)can be viewed as SGD with momentum and adaptive learning rate scaling on each coordinate of the gradient. It has demonstrated promising convergence speed and hyperparameter robustness on many deep learning tasks. Adam’s update rule can be summarized as (for simplicity weight decay is omitted):
(1) 
where is the stochastic gradient at step , is the momentum,
is the second moment of the gradient (i.e., the variance),
and are the decaying factor, is the model, is the learning rate, is an additive constant that ensures that we do not divide by 0. Here , and all denote elementwise operations.Recently, Tang et al. (2021) proposed 1bit Adam which combines the efficiency of errorcompensated 1bit compression with Adam’s convergence speed. Their study shows that errorcompensated compression does not work for Adam directly, because Adam is nonlinearly dependent on the gradient (variance in (1)) which affects the error compensation mechanism. On the other hand, they find that Adam’s variance becomes stable at an early stage of training. To this end, they design a new 2stage algorithm, 1bit Adam, which uses Adam (warmup stage) to “precondition” a communication compressed momentum SGD algoirthm (compression stage). At warmup stage, vanilla Adam is used. At compression stage, they stop updating the variance and use it as a fixed precondition, and communicate based on the momentum applied with errorcompensated 1bit compression. Their experiments on up to 256 GPUs show that 1bit Adam achieve the same convergence behaviour and final accuracy as Adam, together with up to less communication volume and faster endtoend throughput.
To further improve training efficiency at large scale, being able to support large minibatches while keeping the convergence speed is a critical factor. Recently You et al. (2020) find that it is difficult to keep Adam’s convergence speed at batch sizes 16K or larger for BERT pretraining. To this end they proposed LAMB which can be viewed as Adam with adaptive layerwise learning rates. By using LAMB, they are able to scale the batch size of BERT pretraining to without losing accuracy, thereby, reducing the BERT training time from 3 days to around 76 minutes. The major idea of LAMB is that it utilizes a layerwise scaling coefficient to regulate the update of each layer, and the updating rule can be summarized as:
(2) 
Here denotes the momentum of the th layer, and , , follows the same definition; is the clipping operation^{4}^{4}4In the original LAMB paper the clip function is only applied to . In the paper they didn’t mention the exact clipping function configurations, and our experiments show that varies a lot among different layers. Thus we apply the clipping function to the whole ratio, which is more stable among different layers. With this clipping function we are able to achieve similar SQuAD finetuning accuracy compared to the original LAMB work (You et al., 2020).; is a layerwise scaling factor that regulates the update of into certain range. One thing to note is that within each layer, each tensor (e.g., weight and bias) will have its own scaling coefficient . The underlying intuition of LAMB’s scaling coefficient is that when the update is relatively large compared to the parameter, we should apply a lower learning rate to that layer (and vice versa).
Motivated by the 1bit Adam and LAMB works, in the next section we investigate whether communication compression could be beneficial to largebatch training with LAMB, and whether the insights behind 1bit Adam can be applied to LAMB to achieve “1bit LAMB”: communication efficient largescale largebatch training with LAMB’s convergence speed.
3 Motivation and Insights
3.1 1bit Adam is not sufficient for largebatch distributed training
The 1bit Adam work claims to achieve the same convergence speed as Adam (based on BERT pretraining experiments with 4K batch size) (Tang et al., 2021). On the other hand, the LAMB work shows that it is difficult to keep Adam’s convergence speed at batch sizes 16K or larger for BERT pretraining (You et al., 2020). To find out whether 1bit Adam is sufficient for largebatch distributed training, we perform a similar experiment using BERT pretraining task at batch size 16K. Using You et al. (2020)’s training parameters (and tuning procedure) for LAMB and Adam, we perform BERT pretraining with LAMB and 1bit Adam, respectively^{5}^{5}5In Section 3.1 for both LAMB and 1bit Adam, we use batch size = 16K, 28125/3125 steps for seqlen 128/512, weight decay = 0.01, linear LR warmup and decay. For LAMB, we use learning rate = , 10% LR warmup, clipping configs ( and in (2)) as 0.1 and 1. For 1bit Adam, we use learning rates , LR warmup . All of these training parameters (except LAMB clipping configs) are from the LAMB paper.. For 1bit Adam, following the original work’s strategy we set the number of warmup steps as 4000 (out of total 28125 steps) for seqlen 128 and 475 (out of 3125) for seqlen 512. Then we use the two pretrained BERT model to finetune SQuAD 1.1 (detailed training parameters in Section 6.1), and Table 1 summarizes the results. Results show that similar to what You et al. (2020) observe for LAMB and Adam, 1bit Adam has slower convergence speed compared to LAMB at larger batch size.
Optimizer for  BERT pretraining  SQuAD  SQuAD 
BERT pretraining  validation loss  Avg. F1  Max F1 
LAMB (You et al., 2020)  91.345  
Adam (You et al., 2020)  88.540  
LAMB (ours)  1.362  90.716  91.119 
1bit Adam (ours)  1.504  89.225  89.494 
3.2 Communication is still an overhead for largebatch distributed training
In the 1bit Adam work, Tang et al. (2021) demonstrated that when pretraining BERT with vanilla Adam, the allreduce communication may take up to 94% and 75% of total training time per step on two clusters with Ethernet and InfiniBand networks. Since LAMB supports larger batch sizes for BERT pretraining and larger batch sizes lead to less number of communications, we want to investigate whether the communication overhead still exists. Thus we conduct a similar performance profiling experiments using BERTLarge pretraining task (sequence length 128, detailed training parameters can be found at Section 6.1) but with LAMB optimizer. We evaluate two kinds of clusters: one with 4 NVIDIA Tesla V100 GPUs per node (16GB GPU memory), and 40 Gigabit Ethernet internode network (4.1 Gbps effective bandwidth based on iperf benchmark); the other cluster with 8 V100 GPUs per node (32GB GPU memory), and 100 Gigabit InfiniBand EDR internode network (close to theoretical peak effective bandwidth based on microbenchmark).
As presented in Table 2, the profiling results show that even with larger batch sizes the allreduce communication still contributes to a great portion of the training time per step, up to 91% and 52% on two different kinds of clusters. And this overhead is larger when the number of nodes is larger, when the batch size is smaller, when the network bandwidth is lower. These results indicate the opportunities to improve largebatch distributed training efficiency by communication compression.
Cluster  Num.  Num.  Batch  Batch  Grad  Forward  Backward  Backward  Step  allreduce% 

Network  node  GPU  size per  size  accum.  (ms)  allreduce  everything  (ms)  
Type  GPU  step  (ms)  else (ms)  
Ethernet  64  256  16  8K  2  55  3579  117  191  91% 
Ethernet  64  256  16  16K  4  111  3533  227  195  87% 
Ethernet  64  256  16  32K  8  224  3599  462  233  80% 
Ethernet  64  256  16  64K  16  445  3674  919  215  70% 
Ethernet  32  128  16  8K  4  112  3759  233  121  89% 
Ethernet  16  64  16  8K  8  223  3433  464  109  81% 
Ethernet  8  32  16  8K  16  445  3528  923  38  72% 
Ethernet  4  16  16  8K  32  881  3436  1827  33  56% 
Ethernet  2  8  16  8K  64  1773  2087  3696  31  28% 
Ethernet  1  4  16  8K  128  3532  234  7329  30  2% 
InfiniBand  16  128  64  8K  1  96  335  179  36  52% 
InfiniBand  16  128  64  16K  2  192  346  356  37  37% 
InfiniBand  16  128  64  32K  4  381  377  714  37  25% 
InfiniBand  16  128  64  64K  8  770  422  1422  32  16% 
InfiniBand  8  64  64  8K  2  192  332  352  34  36% 
InfiniBand  4  32  64  8K  4  384  339  711  31  23% 
InfiniBand  2  16  64  8K  8  768  270  1436  31  11% 
InfiniBand  1  8  64  8K  16  1534  167  2869  31  4% 
3.3 Investigating BERT pretraining with baseline LAMB: scaling coefficients become stable during training, but the variance is less stable
As discussed in the 1bit Adam work (Tang et al., 2021), the variance term ( in (1)) makes Adam incompatible with error compensation because it’s nonlinearly dependent to gradient, therefore directly applying communication compression to Adam would greatly degrades the convergence speed. Since LAMB inherits Adam’s optimization algorithm, it also has the variance term as a nonlinearly gradient dependency. In addition, the calculation of LAMB’s scaling coefficient ( in (2)) depends on the variance. Thus we also cannot directly apply communication compression to LAMB. Another key finding in the 1bit Adam work is that Adam’s variance term becomes stable at an early stage of training (after around 15% of total training for BERT), which is why 1bit Adam can “freeze” the variance after the warmup stage and use it as a fixed precondition during 1bit compression. In this section, we use BERT pretraining task to investigate how LAMB’s scaling coefficient and variance change during training.
Figure 1 presents LAMB’s scaling coefficients for different layers during BERT pretraining sequence length 128 (sequence length 512 has similar patterns). Results demonstrate that the scaling coefficients generally keep increasing until reaching the upper bound or a plateau, which is expected since with more learning the update tends to become smaller compared to the parameter. In addition, most of the scaling coefficients become stable at an early stage, after to steps out of total steps. Only the cls.seq_relationship.bias has a very unstable scaling coefficient. This is because this bias only has two elements, representing the two states of whether the two sequences are next to each other. Results in this Figure also demonstrate that LAMB provides adaptive learning rate in two folds: 1) different layers may reach scaling coefficient plateaus at different time; 2) different layers may reach different scaling coefficient plateaus. We believe that this is one of the reasons why LAMB can provide better convergence (or reach the same convergence with less hyperparameter tuning) at larger batch sizes compared with Adam.
Figure 2 presents LAMB’s variance norms for different layers during BERT pretraining sequence length 128 (sequence length 512 has similar patterns). Results demonstrate that LAMB’s varaince terms are less stable compared with Adam. In the 1bit Adam work, Tang et al. (2021) demonstrate that Adam’s variance norm stays at the same magnitude after around 15% of total training. On the other hand, Figure 2 shows that for LAMB many layers have their variance norms constantly decreasing during the whole training task, up to two orders of magnitude difference. We believe that there are two reasons: 1) LAMB has larger batch size, smaller number of steps, and layerwise adaptive learning rates compared to Adam. 2) Because LAMB requires the calculation of the scaling coefficient for each layer, we cannot fuse all the variance together as in Adam. And each separate variance could have less stable norm compared to a single fused variance.
3.4 Existing compression method affects LAMB’s convergence
Based on our study of baseline LAMB, we want to find out whether it’s possible to directly apply 1bit Adam’s twostage algorithm in order to combine 1bit compression’s communication efficiency and LAMB’s convergence speed at large batch sizes. In this section we design and evaluate an experimental twostage algorithm “LAMB + basic 1bit”: 1) At the warmup stage (first steps out of ), vanilla LAMB is used and we keep track of a moving average of LAMB’s scaling coefficients (with ). 2) At the compression stage, we stop updating the variance and LAMB’s scaling coefficient and use them as precondition, and communicate based on 1bit compressed momentum.^{6}^{6}6We also tried communicating based on 1bit compressed gradient, but it leads to much slower convergence speed. We believe it’s because gradients are less stable, which could lead to higher compression error and slower convergence. For simplicity we only apply this experimental algorithm to BERT pretraining seqlen 128 phase, and still only use vanilla LAMB in seqlen 512 phase.
Because LAMB’s scaling coefficient is essentially part of the learning rate (and it’s just a scalar), updating this coefficient would not affect communication compression’s error compensation mechanism as long as the change is small enough between two steps.^{7}^{7}7In fact as described in Section 4, the proposed 1bit LAMB algorithm does require updating LAMB’s scaling coefficient during compression stage, but in a way different from original LAMB algorithm. However, we find that it’s challenging to update this coefficient during compression stage because in LAMB’s algorithm, updating this scaling coefficient requires both momentum and variance term. However, the error compensation mechanism requires freezing the variance term at the beginning of the compression stage due to its nonlinear dependency to the gradient. In addition, we find that due to the error compensation mechanism, the norm of some layers’ momentum term could become larger/smaller compared to the uncompressed case. As a result, we find that updating LAMB’s scaling coefficients during compression stage based on original LAMB algorithm would produce suboptimal results and slower convergence speed. Thus for this section’s experimental algorithm, we stop updating LAMB’s scaling coefficients during compression stage, and just use the calculated scaling coefficient moving average at the end of the warmup stage.
Table 3 presents the validation loss at the end of BERT pretraining when using vanilla LAMB, the experimental algorithm described in this section, and the proposed 1bit LAMB algorithm that we will describe in the upcoming Section 4. We also use the pretrained BERT model to finetune SQuAD 1.1 and present the F1 scores. Results show that simply freezing both variance and LAMB’s scaling coefficient would affect the convergence speed and lead to lower SQuAD scores (and lower GLUE scores in Section 6.2). Even if we increase the warmup steps from to , the convergence speed is still suboptimal. This is because LAMB’s variance term is less stable as demonstrated in our study. On the other hand, with the same warmup steps, the proposed 1bit LAMB algorithm is able to provide the same convergence speed as vanilla LAMB as shown in Table 3. In the upcoming section we will describe how 1bit LAMB compensates the effect of freezing unstable variance by adaptively updating LAMB’s scaling coefficient in a novel way during compression stage.
Optimizer for  BERT pretraining  SQuAD  SQuAD 

BERT pretraining  validation loss  Avg. F1  Max F1 
LAMB (You et al., 2020)  90.584  
LAMB (ours)  1.451  90.265  90.555 
LAMB + basic 1bit  1.494  90.069  90.409 
1bit LAMB  1.443  90.524  90.788 
4 1bit LAMB Algorithm
The proposed 1bit LAMB optimizer introduces a novel way to update the adaptive layerwise learning rate during the compression stage. There are two major differences between 1bit LAMB and the original LAMB:

During compression stage, 1bit LAMB updates the layerwise learning rate based on a novel “reconstructed gradient” based on the compressed momentum. This makes 1bit LAMB compatible with error compensation and be able to keep track of the training dynamic under compression.

1bit LAMB also introduces extra stabilized soft thresholds when updating layerwise learning rate at compression stage, which makes training more stable under compression.
We summarize 1bit LAMB in Algorithm 1. The following subsections describe the contributions in terms of algorithm and implementation.
Problem setting
In this paper, we focus on the following optimization task and rely on the following notions and definitions:
(3) 
where is the dimension of the input model , is the data distribution of individual data sample on the th worker,
is the loss function.
Notations and definitions
Throughout this paper, we use the following notations:

denotes the gradient of a function .

.

denotes the randomized compressing operator, where
denotes the random variable. One example is the randomized quantization operator, for example,
with probability
and with probability . 
denotes the square root of the argument. In this paper if the argument is a vector, then it returns a vector taking the elementwise square root.

denotes the elementwise square operation if is a vector.

or denotes the elementwise division operation if both and are vectors and their dimension matches.
4.1 Adaptively updating LAMB scaling coefficient during compression stage based on reconstructed “fresh” variance
During the compression stage, we have to freeze the variance in order to apply the error compensation mechanism correctly (Tang et al., 2021). However, this brings two challenges when combining LAMB with communication compression as described in Section 3: 1) We cannot update LAMB’s scaling coefficients ( in (2)) during compression stage based on LAMB algorithm, because it requires uncompressed momentum and uptodate variance; 2) In vanilla LAMB the scaling coefficients become stable during training. However, because we have to freeze the variance term and LAMB’s variance term is not stable for some layers, we have to adjust the scaling coefficients to compensate this changes on variance. Otherwise we have shown in Section 3.4 that freezing both variance and LAMB scaling coefficients leads to slower convergence speed.
To this end, 1bit LAMB uses a novel way to adaptively update LAMB scaling coefficients during the compression stage to compensate the difference between frozen variance and actual variance. During the warmup stage, vallia LAMB is used and we also keep track of the moving average of each layer’s scaling coefficient which is used during the compression stage. We use this moving average instead of the exact coefficient at the end of warmup because as shown in Section 3.3 the scaling coefficient is not stable at early stage of training. At the end of warmup (Algorithm 1 line 3), we stop updating the scaling coefficient moving average, and store the frozen variance to be used when updating the model during compression stage (line 21). On the other hand, we still keep updating another “fresh” variance by reconstructing the global gradient based on this step and last step’s momentum (line 15, 16).^{8}^{8}8One may ask why not communicate based on compressed gradient so that the fresh variance can be computed directly. However, compressing gradients lead to higher compression error and slower divergence as we tried in Section 3.4.
To update LAMB scaling coefficient during compression stage, we compute a 1bit LAMB scaling ratio which is the max element among the (frozen varaince/fresh varaince) (line 17). We use the max element as the scaling ratio because it is effective and cheap to compute based on our experiments. To avoid extreme ratios and dramatic change between ratios, we use two kinds of clipping configurable by the user (line 18, 19). On the other hand, the clipping from the LAMB algorithm ( and in (2)) are not used during the compression stage. Then we compute this step’s LAMB scaling coefficient using the 1bit LAMB scaling ratio and the moving average at the end of warmup, and use it to update the model (line 20, 21).
Figure 3 presents 1bit LAMB scaling ratios for different layers during BERT pretraining sequence length 128 (sequence length 512 has similar patterns). When comparing with Figure 2, we find that for those layers with less stable varaince (e.g., 3 kinds of embeddings, weights in BertLayer), the corresponding 1bit LAMB scaling ratios are also larger. As a result 1bit LAMB is able to adaptively update LAMB scaling coefficients during compression according to the difference between the frozen and fresh variance.
4.2 Reducing number of communications by momentum fusion
Different from Adam, LAMB has the scaling coefficients that need to be updated separately for each layer. For the communication of the compressed momentum during 1bit LAMB’s compression stage, if we also communicate separately for each layer, the number of communications (which is 302 for BERTLarge model) will greatly affect the overall performance. Thus we fuse all layers’ momentum into a contiguous 1D buffer and just do one communication over this fused momentum. This fused momentum is not a duplication, but a different “view” of all layers’ momentum. We implement this momentum fusion by torch._utils._flatten_dense_tensors and torch._utils._unflatten_dense_tensors.
4.3 Reducing compression error by momentum scaling
We find that after momentum fusion, the compression error increase a lot for some layers which greatly increase the chance of divergence. This is because: 1) When communicating base on the fused momentum, all layers’ momentum will be compressed to the same scale due to the nature of 1bit compression (representing the tensor by signs and a single scale). Thus the layers with very small/large momentum scale will have larger compression error; 2) Due to LAMB’s layerwise adaptive learning rate, each layer is learning at different speed, which could further increase the variance of momentum scale among layers. We solve this issue by computing an average momentum scale at the end of warmup stage and using it to compute a momentum scale coefficient for each layer. During the compression stage, we multiply each layer’s local momentum by its scale coefficient, then do the compressed communication, then divide each layer’s global momentum by the same scale coefficient. By performing this momentum scaling, all layers’ momentum will have similar scales when passed to the compressed communication. Thus we are able to greatly reduce the compression error and chance of divergence. Since the momentum scale coefficients are fixed during the compression stage, it won’t affect 1bit compression’s error compensation mechanism.
5 Proposed System Design
To realize the compressed communication at system level, in the 1bit Adam work Tang et al. (2021) designed a custom collective primitive, called “compressed allreduce”, using Message Passing Interface (MPI) because NCCL library (before v2.7) does not expose either an Alltoall primitive or any pointtopoint (send/recv) communication primitives that can be used to implement an Alltoall. In addition to implementing 1bit LAMB using a MPIbased backend similar to 1bit Adam, we introduce a new system implementation for compressed communication (applicable to both 1bit LAMB and 1bit Adam) using the NCCL backend of PyTorch distributed. This implementation follows the 3phase design proposed in 1bit Adam work: 1) The gather step, where each worker sends its th chunk to worker . We implement this step using NCCL’s Alltoall (introduced in v2.7) and AllGather. 2) The average step, where each worker averages all chunks it receives. 3) The scatter step, where each worker receives the average of all th chunks from worker . We implement this step using NCCL’s AllGather.
Compared to the MPIbased implementation, this new NCCLbased implementation significantly improves the usability since NCCL is integrated with PyTorch distributed. In addition, our evaluations show that the performance of the NCCLbased implementation is better than the MPIbased implementation for Ethernetbased systems and onpar for InfiniBandbased systems. Thus in the evaluation section we focus on presenting the results with NCCLbased backend, but we also include a performance comparison between MPI and NCCLbased implementations.
6 Evaluation
6.1 Methodology
6.1.1 Dataset and models
We evaluate the convergence and performance of 1bit LAMB and uncompressed LAMB for BERTLarge (, , , params) pretraining task. We use the same dataset as Devlin et al. (2019), which is a concatenation of Wikipedia and BooksCorpus with and words respectively. Compared to the original BERT model, one notable change is that we applied PreLN instead of PostLN for better training stability (Zhang and He, 2020; Xiong et al., 2020). We use the GLUE finetuning benchmark (Wang et al., 2018) and SQuAD 1.1 finetuning task^{9}^{9}9https://rajpurkar.github.io/SQuADexplorer/ to evaluate the convergence of the BERT models trained by LAMB and 1bit LAMB.
6.1.2 Hardware
We use the two clusters described in Section 3.2. We use 8 to 256 GPUs for BERT pretraining tasks to measure 1bit LAMB’s performance gain. For finetuning tasks we use 4 GPUs for each run. One thing to note is that because 1bit LAMB introduces additional memory overhead (one persistent copy of the fresh varaince and one temporary copy of last step’s momentum) and because the V100 GPUs on the Ethernet cluster have 16GB memory instead of 32GB, we were not able to fit large batch sizes for BERT pretraining when the number of GPUs is small. For seqlen 128 and 512, we need to use at least 8 and 16 GPUs on the Ethernet cluster, respectively. On the other hand, since the benefit of compression depends on the communication overhead percentage, 1bit LAMB provides more performance gain as the number of GPUs increase as shown in the evaluation results.
6.1.3 Training parameters
For BERT pretraining, we set the parameters in (2) as , , and for LAMB and 1bit LAMB. For 1bit LAMB, we set , , , and in Algorithm 1. For convergence analysis, we set total batch size as for seqlen 128 and for seqlen 512. For performance analysis, we test different batch sizes from to .
For BERT pretraining seqlen 128, the learning rate starts from , exponentially increases to as a warmup in the first 450 steps, then decays into 0.9 of the original after every 250 steps. The total number of steps is 5993. For 1bit LAMB we use the first 1000 steps (16.7%) as the warmup stage. For BERT pretraining seqlen 512, the learning rate starts from , exponentially increases to as a warmup in the first 150 steps, then decays into 0.9 of the original after every 150 steps. The total number of steps is 555. For 1bit LAMB we use the first 107 steps (19.3%) as the warmup stage.
For GLUE benchmarks we use Adam optimizer and perform singletask training on the dev set. Following the setup in the BERT paper (Devlin et al., 2019)
, we use a batch size of 32 and finetune for 3 epochs for all GLUE tasks. For each task, we select the best learning rate among
. For SQuAD finetuning we use Adam optimizer and the same parameters as published by HuggingFace (batch size = , learning rate = , dropout = , 2 epochs).6.2 Convergence analysis
Figure 4 presents the BERT pretraining samplewise convergence results. For both seqlen 128 and 512, 1bit LAMB provides the same convergence speed as LAMB, while it takes much less time to process each batch due to communication compression. We already presented the BERT pretraining validation loss and SQuAD finetuning results in Section 3.4 Table 3. And Table 4 presents the GLUE results. For both SQuAD and GLUE, results show that 1bit LAMB provides the same finetuning task accuracy as LAMB, while simply freezing both the varaince and LAMB’s scaling coefficients would hurt the accuracy.
MNLI(m/mm)  QQP  QNLI  SST2  CoLA  STSB  MRPC  RTE  Average  

Original  86.7/85.9  89.3  92.7  94.9  60.5  86.5  85.4  70.1  83.6 
LAMB  85.4/85.5  91.4  91.9  92.8  60.9  90.1  86.9  70.4  83.9 
LAMB + basic 1bit  84.8/84.7  91.2  91.4  92.4  54.0  89.6  84.2  66.6  82.1 
1bit LAMB  85.5/85.6  91.3  92.3  93.1  59.2  90.0  86.5  71.5  83.9 
6.3 Performance analysis
Computed as 1/(warmup_ratio + (1  warmup_ratio)/16) for FP16 training, 1bit LAMB offers and endtoend communication volume reduction for BERTLarge seqlen 128 and 512, respectively. To measure the actual endtoend performance gain, first we perform a throughput analysis where we run the warmup (i.e., baseline LAMB’s performance) and compression stage of 1bit LAMB for 200 steps each, and measure the average throughput of the two stages. Figure 5 presents the results with NCCLbased compressed communication backend under different batch sizes and number of GPUs. For seqlen 128, 1bit LAMB provides up to speedup during the compression stage, which is equivalent to endtoend speedup (computed as 1/(warmup_ratio + (1  warmup_ratio)/compression_stage_speedup)). For seqlen 512, 1bit LAMB provides up to speedup during the compression stage, which is equivalent to endtoend speedup. This demonstrates 1bit LAMB’s better scalability compared to LAMB. It is also worth mentioning that 1bit LAMB on Ethernet (4.1 Gbps effective bandwidth, 4 GPUs per node) is able to achieve comparable throughput as LAMB on InfiniBand (near 100 Gbps effective bandwidth, 8 GPUs per node), which demonstrates 1bit LAMB’s efficiency considering the hardware differences.
In addition to throughput analysis, we also measure the total runtime of pretraining at batch size for both 1bit LAMB and LAMB. As shown in Table 5, overall 1bit LAMB is able to provide and speedup for seqlen 128 and 512. These numbers are consistent with the endtoend speedup calculated in the throughput analysis ( and based on results in Figure 5(d) and 5(g)). For seqlen 128, the endtoend speedup based on runtime is slightly higher than the speedup based on throughput. We find that it is because uncompressed LAMB’s larger communication volume makes it more sensitive to the occasional fluctuation of the actual network bandwidth.
All of the results above are evaluated with the NCCLbased compressed communicaiton backend implementation. Figure 6 presents the performance comparison between MPI and NCCLbased implementations. Compared to the MPIbased implementation introduced in the 1bit Adam work, our NCCLbased implementation is able to provide better performance on the Ethernet cluster (where openmpi library is used for MPI backend) and on par performance on the InfiniBand cluster (where MVAPICH2GDR is used for MPI backend).
Seqlen 128  Seqlen 512  Total  

LAMB  657 min  74 min  731 min 
1bit LAMB  301 min (2.2x)  50 min (1.5x)  351 min (2.1x) 
7 Conclusion
To reduce both the number and volume of communications for largescale training, we propose an errorcompensated LAMB preconditioned momentum SGD algorithm, 1bit LAMB, which combines the power of large batch optimization and communication compression by introducing an novel way to support adaptive layerwise learning rates during communication compression. We also introduce an easiertouse and more efficient compressed communication backend system based on NCCL. Evaluations show that 1bit LAMB with NCCLbased backend is able to achieve up to communication volume reduction and up to endtoend speedup for BERT pretraining compared to uncompressed LAMB, together with the same convergence speed and finetuning task accuracy.
References
 cpSGD: communicationefficient and differentiallyprivate distributed SGD. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 7564–7575. Cited by: §2.1.
 QSGD: CommunicationEfficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1709–1720. Cited by: §2.1.
 Qsparselocalsgd: distributed sgd with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 14695–14706. Cited by: §2.1.
 SignSGD with majority vote is communication efficient and fault tolerant. In International Conference on Learning Representations, External Links: Link Cited by: §2.1, §2.1.
 Language models are fewshot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: §1.
 Asynchronous stochastic convex optimization: the noise is in the noise and sgd don t care. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1531–1539. Cited by: §2.1.
 BERT: pretraining of deep bidirectional transformers for language understanding. In NAACLHLT, Cited by: §6.1.1, §6.1.3, Table 4.

Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research
12 (61), pp. 2121–2159. External Links: Link Cited by: §2.2.  Linearly converging error compensated sgd. External Links: 2010.12292 Cited by: §2.1.
 Communicationefficient distributed sgd with sketching. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 13144–13154. Cited by: §2.1.
 A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 2530–2541. Cited by: §2.1.
 Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §2.2.
 Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, External Links: Link Cited by: §2.1.
 Pipesgd: a decentralized pipelined sgd framework for distributed deep net training. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 8056–8067. Cited by: §2.1.

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent
. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5330–5340. Cited by: §2.1.  Distributed sgd with flexible gradient compression. IEEE Access 8 (), pp. 64707–64717. Cited by: §2.1.
 1bit stochastic gradient descent and application to dataparallel distributed training of speech dnns. In Interspeech 2014, Interspeech 2014 edition. Cited by: §2.1, §2.1.
 Towards more efficient stochastic decentralized learning: faster convergence and sparse communication. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4624–4633. Cited by: §2.1.
 A distributed synchronous sgd algorithm with global topk sparsification for low bandwidth networks. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Vol. , pp. 2238–2247. Cited by: §2.1.
 Towards scalable distributed training of deep learning on public cloud clusters. In Proceedings of Machine Learning and Systems, Cited by: §2.1.
 Compressing gradient optimizers via CountSketches. Proceedings of the 36th International Conference on Machine Learning 97, pp. 5946–5955. Cited by: §2.1.
 Sparsified sgd with memory. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 4447–4458. Cited by: §2.1, §2.1.
 Communicationefficient distributed learning via lazily aggregated quantized gradients. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 3370–3380. Cited by: §2.1.
 1bit adam: communication efficient largescale training with adam’s convergence speed. External Links: 2102.02888 Cited by: §1, §1, §1, §2.1, §2.2, §3.1, §3.2, §3.3, §3.3, §4.1, §5.

RMSprop: divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning
. Cited by: §2.2.  PowerSGD: practical lowrank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 14259–14268. Cited by: §2.1.
 PowerGossip: practical lowrank communication compression in decentralized deep learning. External Links: 2008.01425 Cited by: §2.1.
 GLUE: a multitask benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Document, Link Cited by: §6.1.1.
 Gradient sparsification for CommunicationEfficient distributed optimization. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 1299–1309. Cited by: §2.1.
 TernGrad: ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1509–1519. Cited by: §2.1.
 CSER: communicationefficient sgd with error reset. External Links: 2007.13221 Cited by: §2.1.
 On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 10524–10533. External Links: Link Cited by: §6.1.1.
 CommunicationComputation efficient gradient coding. Proceedings of the 35th International Conference on Machine Learning 80, pp. 5610–5619. Cited by: §2.1.
 Large batch optimization for deep learning: training bert in 76 minutes. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2.2, §3.1, Table 1, Table 3, footnote 4.
 Double quantization for communicationefficient distributed optimization. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 4438–4449. Cited by: §2.1.
 ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. External Links: Link, 1212.5701 Cited by: §2.2.
 ZipML: training linear models with endtoend low precision, and a little bit of deep learning. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 4035–4043. Cited by: §2.1.
 Accelerating training of transformerbased language models with progressive layer dropping. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 14011–14023. External Links: Link Cited by: §6.1.1.
 Communicationefficient distributed blockwise momentum sgd with errorfeedback. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 11450–11460. Cited by: §2.1.
 Asynchronous stochastic gradient descent with delay compensation for distributed deep learning. CoRR abs/1609.08326. External Links: 1609.08326 Cited by: §2.1.
Comments
There are no comments yet.