Large-scale training of deep learning models with up to hundred-billion parameters is computation-heavy and expensive(Brown et al., 2020). In addition, another serious system challenge is the communication overhead. A recent study of BERT pre-training with Adam demonstrates that the allreduce communication can take up to 94% and 75% of total training time per step on clusters with Ethernet and InfiniBand inter-node connections, respectively (Tang et al., 2021).
To achieve communication efficient distributed training, there are two promising directions: large batch optimization and communication compression. LAMB optimizer, which can be viewed as Adam with adaptive layerwise learning rates, is an example of large batch optimization (You et al., 2020). LAMB can scale the batch size of BERT pre-training to 64K without losing accuracy, thereby greatly reducing the total training time as larger batches require less number of communications. On the other hand, recent works on communication compression such as 1-bit Adam demonstrates that it is possible to combine 1-bit compression with Adam’s convergence speed, thereby reduce BERT pre-training communication volume by (Tang et al., 2021).
Both LAMB and 1-bit Adam demonstrate great benefit for distributed training. Unfortunately, our studies show that simply using one of them is not sufficient to fully address the communication issue, especially under limited network bandwidth and large number of GPUs/machines. We find that communication is still a non-trivial overhead when running large-scale distributed training with LAMB, even with the larger batch sizes. 1-bit Adam has the same convergence speed as Adam (Tang et al., 2021), and previous study shows that Adam provides slower convergence speed compared to LAMB at batch sizes 16K or larger for BERT pre-training (You et al., 2020). Using You et al. (2020)’s methodology, our BERT pre-training experiments show that 1-bit Adam, similar to Adam, also has slower convergence speed compared to LAMB at batch size 16K. Even with the communication compression, this batch size limitation would hurt the communication efficiency when the number of GPUs/machines is large.
LAMB and 1-bit Adam are two unique optimizers. However, the techniques behind them are complementary: large batch optimization reduces number of communication, and compression reduces communication volume. Motivated by this we aim to combine LAMB’s large batch optimization algorithm with compression strategies behind 1-bit Adam. However, we find that they are not directly compatible due to LAMB’s unique layerwise learning rates update strategy, which requires information that are missing when communication (and optimizer states) is compressed.
The studies and challenges above motivate us to design a new algorithm called 1-bit LAMB. Learning from the insights behind 1-bit Adam, 1-bit LAMB is a 2-stage algorithm which uses LAMB (warmup stage) to “pre-condition” a communication compressed momentum SGD algoirthm (compression stage). At compression stage where original LAMB algorithm cannot be used to update the layerwise learning rates, 1-bit LAMB employs a novel way to adaptively scale layerwise learning rates based on information from both warmup and compression stages. As a result, 1-bit LAMB is able to achieve large batch optimization (LAMB)’s convergence speed under compressed communication, which is impossible using existing approaches.
We implement the 1-bit LAMB algorithm using exisiting MPI-based compressed communication backend (introduced in the 1-bit Adam work), and a new NCCL-based compressed communication backend (introduced in this paper) which provides better usability and performance. We evaluate 1-bit LAMB’s convergence and performance by BERT pre-training and GLUE/SQuAD fine-tuning tasks. Results show that under different batch sizes from to and with up to 256 GPUs, 1-bit LAMB with NCCL-based backend is able to achieve up to communication volume reduction and up to end-to-end speedup (in terms of number of training samples per second) for BERT pre-training compared to uncompressed LAMB, together with the same convergence speed (in terms of number of pre-training samples to reach the same accuracy on GLUE/SQuAD fine-tuning tasks).
We make the following contributions:
We conduct an extensive study of distributed training with LAMB and 1-bit Adam, which provides detailed insights behind the algorithm and motivates our work. (Section 3)
We propose a new algorithm, 1-bit LAMB, a communication efficient momentum SGD algorithm pre-conditioned with LAMB optimizer, which to the best of our knowledge is the first work that combines communication compression and large batch optimization. (Section 4)
We implement a custom collective communication primitive using NCCL backend of PyTorch distributed which provides better usability and performance than existing solutions. This backend can be applied to 1-bit LAMB, 1-bit Adam, and potentially other communication compression algorithms. (Section 5)
We conduct large-scale convergence and performance experiments on BERT pre-training and GLUE/SQuAD fine-tuning, which demonstrate 1-bit LAMB’s superior performance and competitive convergence speed compared to LAMB. (Section 6)
2 Related Work and Background
2.1 Communication efficient distributed training
To achieve communication efficient distributed training, techniques include decentralization (Lian et al., 2017; Koloskova* et al., 2020; Li et al., 2018), asynchronous communication (Zheng et al., 2016; Chaturapruek et al., 2015), and gradient compression/quantization which we focus on in this paper. Before communication, we could compress the original gradient into , where is the compress operator333 could also include randomness.. As a result the communication volume could be greatly reduced. Compression can be achieved by quantization, sparsification, sketching, etc. (Ye and Abbe, 2018; Alistarh et al., 2017; Agarwal et al., 2018; Yu et al., 2019; Spring et al., 2019; Ivkin et al., 2019; Shi et al., 2021)
. Several works focus on unbiased compression methods (original and compressed tensors have the same expectation), such as centralized compressed parallel SGD(Alistarh et al., 2017) and many others (Wangni et al., 2018; Shen et al., 2018; Zhang et al., 2017; Wen et al., 2017; Jiang and Agrawal, 2018). On the other hand, recent works about biased compression methods demonstrate better compression rate and the same convergence rate by using an error compensation technique (Seide et al., 2014; Bernstein et al., 2019; Stich et al., 2018; Zheng et al., 2019; Phuong and Phong, 2020; Yu et al., 2019; Shi et al., 2019; Ivkin et al., 2019; Sun et al., 2019; Basu et al., 2019; Vogels et al., 2019; Tang et al., 2021).
The idea of using error compensation for compression is proposed in the 1-bit SGD work (Seide et al., 2014): instead of compressing the gradient at each iteration directly, they compress the sum of the gradient and the last step’s compression error. They find that by using error compensation the training can achieve promising convergence speed even with 1-bit compression (representing the gradient by signs and a scale). Recent works provide theoretical guarantee of this method (Bernstein et al., 2019), and also demonstrate that it admits the same asymptotic convergence rate as the uncompressed one (Stich et al., 2018). In addition, error compensation method enables almost any compression methods (Stich et al., 2018), either biased or unbiased, to converge as fast as the uncompressed case. And it is compatible with other techniques including decentralized training (Vogels et al., 2020), local SGD (Xie et al., 2020), and accelerated algorithms (Gorbunov et al., 2020).
2.2 Adam, 1-bit Adam, and LAMB
Influenced by previous works that study adaptive learning rate (Adagrad (Duchi et al., 2011)
, RMSprop(Tieleman and Hinton, 2011), Adadelta (Zeiler, 2012), etc.), Adam (Kingma and Ba, 2015)
can be viewed as SGD with momentum and adaptive learning rate scaling on each coordinate of the gradient. It has demonstrated promising convergence speed and hyperparameter robustness on many deep learning tasks. Adam’s update rule can be summarized as (for simplicity weight decay is omitted):
where is the stochastic gradient at step , is the momentum,and are the decaying factor, is the model, is the learning rate, is an additive constant that ensures that we do not divide by 0. Here , and all denote element-wise operations.
Recently, Tang et al. (2021) proposed 1-bit Adam which combines the efficiency of error-compensated 1-bit compression with Adam’s convergence speed. Their study shows that error-compensated compression does not work for Adam directly, because Adam is non-linearly dependent on the gradient (variance in (1)) which affects the error compensation mechanism. On the other hand, they find that Adam’s variance becomes stable at an early stage of training. To this end, they design a new 2-stage algorithm, 1-bit Adam, which uses Adam (warmup stage) to “pre-condition” a communication compressed momentum SGD algoirthm (compression stage). At warmup stage, vanilla Adam is used. At compression stage, they stop updating the variance and use it as a fixed precondition, and communicate based on the momentum applied with error-compensated 1-bit compression. Their experiments on up to 256 GPUs show that 1-bit Adam achieve the same convergence behaviour and final accuracy as Adam, together with up to less communication volume and faster end-to-end throughput.
To further improve training efficiency at large scale, being able to support large minibatches while keeping the convergence speed is a critical factor. Recently You et al. (2020) find that it is difficult to keep Adam’s convergence speed at batch sizes 16K or larger for BERT pre-training. To this end they proposed LAMB which can be viewed as Adam with adaptive layerwise learning rates. By using LAMB, they are able to scale the batch size of BERT pre-training to without losing accuracy, thereby, reducing the BERT training time from 3 days to around 76 minutes. The major idea of LAMB is that it utilizes a layerwise scaling coefficient to regulate the update of each layer, and the updating rule can be summarized as:
Here denotes the momentum of the -th layer, and , , follows the same definition; is the clipping operation444In the original LAMB paper the clip function is only applied to . In the paper they didn’t mention the exact clipping function configurations, and our experiments show that varies a lot among different layers. Thus we apply the clipping function to the whole ratio, which is more stable among different layers. With this clipping function we are able to achieve similar SQuAD fine-tuning accuracy compared to the original LAMB work (You et al., 2020).; is a layer-wise scaling factor that regulates the update of into certain range. One thing to note is that within each layer, each tensor (e.g., weight and bias) will have its own scaling coefficient . The underlying intuition of LAMB’s scaling coefficient is that when the update is relatively large compared to the parameter, we should apply a lower learning rate to that layer (and vice versa).
Motivated by the 1-bit Adam and LAMB works, in the next section we investigate whether communication compression could be beneficial to large-batch training with LAMB, and whether the insights behind 1-bit Adam can be applied to LAMB to achieve “1-bit LAMB”: communication efficient large-scale large-batch training with LAMB’s convergence speed.
3 Motivation and Insights
3.1 1-bit Adam is not sufficient for large-batch distributed training
The 1-bit Adam work claims to achieve the same convergence speed as Adam (based on BERT pre-training experiments with 4K batch size) (Tang et al., 2021). On the other hand, the LAMB work shows that it is difficult to keep Adam’s convergence speed at batch sizes 16K or larger for BERT pre-training (You et al., 2020). To find out whether 1-bit Adam is sufficient for large-batch distributed training, we perform a similar experiment using BERT pre-training task at batch size 16K. Using You et al. (2020)’s training parameters (and tuning procedure) for LAMB and Adam, we perform BERT pre-training with LAMB and 1-bit Adam, respectively555In Section 3.1 for both LAMB and 1-bit Adam, we use batch size = 16K, 28125/3125 steps for seqlen 128/512, weight decay = 0.01, linear LR warmup and decay. For LAMB, we use learning rate = , 10% LR warmup, clipping configs ( and in (2)) as 0.1 and 1. For 1-bit Adam, we use learning rates , LR warmup . All of these training parameters (except LAMB clipping configs) are from the LAMB paper.. For 1-bit Adam, following the original work’s strategy we set the number of warmup steps as 4000 (out of total 28125 steps) for seqlen 128 and 475 (out of 3125) for seqlen 512. Then we use the two pre-trained BERT model to fine-tune SQuAD 1.1 (detailed training parameters in Section 6.1), and Table 1 summarizes the results. Results show that similar to what You et al. (2020) observe for LAMB and Adam, 1-bit Adam has slower convergence speed compared to LAMB at larger batch size.
|Optimizer for||BERT pre-training||SQuAD||SQuAD|
|BERT pre-training||validation loss||Avg. F1||Max F1|
|LAMB (You et al., 2020)||91.345|
|Adam (You et al., 2020)||88.540|
|1-bit Adam (ours)||1.504||89.225||89.494|
3.2 Communication is still an overhead for large-batch distributed training
In the 1-bit Adam work, Tang et al. (2021) demonstrated that when pre-training BERT with vanilla Adam, the allreduce communication may take up to 94% and 75% of total training time per step on two clusters with Ethernet and InfiniBand networks. Since LAMB supports larger batch sizes for BERT pre-training and larger batch sizes lead to less number of communications, we want to investigate whether the communication overhead still exists. Thus we conduct a similar performance profiling experiments using BERT-Large pre-training task (sequence length 128, detailed training parameters can be found at Section 6.1) but with LAMB optimizer. We evaluate two kinds of clusters: one with 4 NVIDIA Tesla V100 GPUs per node (16GB GPU memory), and 40 Gigabit Ethernet inter-node network (4.1 Gbps effective bandwidth based on iperf benchmark); the other cluster with 8 V100 GPUs per node (32GB GPU memory), and 100 Gigabit InfiniBand EDR inter-node network (close to theoretical peak effective bandwidth based on microbenchmark).
As presented in Table 2, the profiling results show that even with larger batch sizes the allreduce communication still contributes to a great portion of the training time per step, up to 91% and 52% on two different kinds of clusters. And this overhead is larger when the number of nodes is larger, when the batch size is smaller, when the network bandwidth is lower. These results indicate the opportunities to improve large-batch distributed training efficiency by communication compression.
3.3 Investigating BERT pre-training with baseline LAMB: scaling coefficients become stable during training, but the variance is less stable
As discussed in the 1-bit Adam work (Tang et al., 2021), the variance term ( in (1)) makes Adam incompatible with error compensation because it’s non-linearly dependent to gradient, therefore directly applying communication compression to Adam would greatly degrades the convergence speed. Since LAMB inherits Adam’s optimization algorithm, it also has the variance term as a non-linearly gradient dependency. In addition, the calculation of LAMB’s scaling coefficient ( in (2)) depends on the variance. Thus we also cannot directly apply communication compression to LAMB. Another key finding in the 1-bit Adam work is that Adam’s variance term becomes stable at an early stage of training (after around 15% of total training for BERT), which is why 1-bit Adam can “freeze” the variance after the warmup stage and use it as a fixed precondition during 1-bit compression. In this section, we use BERT pre-training task to investigate how LAMB’s scaling coefficient and variance change during training.
Figure 1 presents LAMB’s scaling coefficients for different layers during BERT pre-training sequence length 128 (sequence length 512 has similar patterns). Results demonstrate that the scaling coefficients generally keep increasing until reaching the upper bound or a plateau, which is expected since with more learning the update tends to become smaller compared to the parameter. In addition, most of the scaling coefficients become stable at an early stage, after to steps out of total steps. Only the cls.seq_relationship.bias has a very unstable scaling coefficient. This is because this bias only has two elements, representing the two states of whether the two sequences are next to each other. Results in this Figure also demonstrate that LAMB provides adaptive learning rate in two folds: 1) different layers may reach scaling coefficient plateaus at different time; 2) different layers may reach different scaling coefficient plateaus. We believe that this is one of the reasons why LAMB can provide better convergence (or reach the same convergence with less hyperparameter tuning) at larger batch sizes compared with Adam.
Figure 2 presents LAMB’s variance norms for different layers during BERT pre-training sequence length 128 (sequence length 512 has similar patterns). Results demonstrate that LAMB’s varaince terms are less stable compared with Adam. In the 1-bit Adam work, Tang et al. (2021) demonstrate that Adam’s variance norm stays at the same magnitude after around 15% of total training. On the other hand, Figure 2 shows that for LAMB many layers have their variance norms constantly decreasing during the whole training task, up to two orders of magnitude difference. We believe that there are two reasons: 1) LAMB has larger batch size, smaller number of steps, and layerwise adaptive learning rates compared to Adam. 2) Because LAMB requires the calculation of the scaling coefficient for each layer, we cannot fuse all the variance together as in Adam. And each separate variance could have less stable norm compared to a single fused variance.
3.4 Existing compression method affects LAMB’s convergence
Based on our study of baseline LAMB, we want to find out whether it’s possible to directly apply 1-bit Adam’s two-stage algorithm in order to combine 1-bit compression’s communication efficiency and LAMB’s convergence speed at large batch sizes. In this section we design and evaluate an experimental two-stage algorithm “LAMB + basic 1-bit”: 1) At the warmup stage (first steps out of ), vanilla LAMB is used and we keep track of a moving average of LAMB’s scaling coefficients (with ). 2) At the compression stage, we stop updating the variance and LAMB’s scaling coefficient and use them as precondition, and communicate based on 1-bit compressed momentum.666We also tried communicating based on 1-bit compressed gradient, but it leads to much slower convergence speed. We believe it’s because gradients are less stable, which could lead to higher compression error and slower convergence. For simplicity we only apply this experimental algorithm to BERT pre-training seqlen 128 phase, and still only use vanilla LAMB in seqlen 512 phase.
Because LAMB’s scaling coefficient is essentially part of the learning rate (and it’s just a scalar), updating this coefficient would not affect communication compression’s error compensation mechanism as long as the change is small enough between two steps.777In fact as described in Section 4, the proposed 1-bit LAMB algorithm does require updating LAMB’s scaling coefficient during compression stage, but in a way different from original LAMB algorithm. However, we find that it’s challenging to update this coefficient during compression stage because in LAMB’s algorithm, updating this scaling coefficient requires both momentum and variance term. However, the error compensation mechanism requires freezing the variance term at the beginning of the compression stage due to its nonlinear dependency to the gradient. In addition, we find that due to the error compensation mechanism, the norm of some layers’ momentum term could become larger/smaller compared to the uncompressed case. As a result, we find that updating LAMB’s scaling coefficients during compression stage based on original LAMB algorithm would produce suboptimal results and slower convergence speed. Thus for this section’s experimental algorithm, we stop updating LAMB’s scaling coefficients during compression stage, and just use the calculated scaling coefficient moving average at the end of the warmup stage.
Table 3 presents the validation loss at the end of BERT pre-training when using vanilla LAMB, the experimental algorithm described in this section, and the proposed 1-bit LAMB algorithm that we will describe in the upcoming Section 4. We also use the pre-trained BERT model to fine-tune SQuAD 1.1 and present the F1 scores. Results show that simply freezing both variance and LAMB’s scaling coefficient would affect the convergence speed and lead to lower SQuAD scores (and lower GLUE scores in Section 6.2). Even if we increase the warmup steps from to , the convergence speed is still suboptimal. This is because LAMB’s variance term is less stable as demonstrated in our study. On the other hand, with the same warmup steps, the proposed 1-bit LAMB algorithm is able to provide the same convergence speed as vanilla LAMB as shown in Table 3. In the upcoming section we will describe how 1-bit LAMB compensates the effect of freezing unstable variance by adaptively updating LAMB’s scaling coefficient in a novel way during compression stage.
|Optimizer for||BERT pre-training||SQuAD||SQuAD|
|BERT pre-training||validation loss||Avg. F1||Max F1|
|LAMB (You et al., 2020)||90.584|
|LAMB + basic 1-bit||1.494||90.069||90.409|
4 1-bit LAMB Algorithm
The proposed 1-bit LAMB optimizer introduces a novel way to update the adaptive layerwise learning rate during the compression stage. There are two major differences between 1-bit LAMB and the original LAMB:
During compression stage, 1-bit LAMB updates the layerwise learning rate based on a novel “reconstructed gradient” based on the compressed momentum. This makes 1-bit LAMB compatible with error compensation and be able to keep track of the training dynamic under compression.
1-bit LAMB also introduces extra stabilized soft thresholds when updating layerwise learning rate at compression stage, which makes training more stable under compression.
We summarize 1-bit LAMB in Algorithm 1. The following subsections describe the contributions in terms of algorithm and implementation.
In this paper, we focus on the following optimization task and rely on the following notions and definitions:
where is the dimension of the input model , is the data distribution of individual data sample on the -th worker,
is the loss function.
Notations and definitions
Throughout this paper, we use the following notations:
denotes the gradient of a function .
-norm for vectors and matrices. Noticemeans the infinity norm.
denotes the square root of the argument. In this paper if the argument is a vector, then it returns a vector taking the element-wise square root.
denotes the element-wise square operation if is a vector.
or denotes the element-wise division operation if both and are vectors and their dimension matches.
4.1 Adaptively updating LAMB scaling coefficient during compression stage based on reconstructed “fresh” variance
During the compression stage, we have to freeze the variance in order to apply the error compensation mechanism correctly (Tang et al., 2021). However, this brings two challenges when combining LAMB with communication compression as described in Section 3: 1) We cannot update LAMB’s scaling coefficients ( in (2)) during compression stage based on LAMB algorithm, because it requires uncompressed momentum and up-to-date variance; 2) In vanilla LAMB the scaling coefficients become stable during training. However, because we have to freeze the variance term and LAMB’s variance term is not stable for some layers, we have to adjust the scaling coefficients to compensate this changes on variance. Otherwise we have shown in Section 3.4 that freezing both variance and LAMB scaling coefficients leads to slower convergence speed.
To this end, 1-bit LAMB uses a novel way to adaptively update LAMB scaling coefficients during the compression stage to compensate the difference between frozen variance and actual variance. During the warmup stage, vallia LAMB is used and we also keep track of the moving average of each layer’s scaling coefficient which is used during the compression stage. We use this moving average instead of the exact coefficient at the end of warmup because as shown in Section 3.3 the scaling coefficient is not stable at early stage of training. At the end of warmup (Algorithm 1 line 3), we stop updating the scaling coefficient moving average, and store the frozen variance to be used when updating the model during compression stage (line 21). On the other hand, we still keep updating another “fresh” variance by reconstructing the global gradient based on this step and last step’s momentum (line 15, 16).888One may ask why not communicate based on compressed gradient so that the fresh variance can be computed directly. However, compressing gradients lead to higher compression error and slower divergence as we tried in Section 3.4.
To update LAMB scaling coefficient during compression stage, we compute a 1-bit LAMB scaling ratio which is the max element among the (frozen varaince/fresh varaince) (line 17). We use the max element as the scaling ratio because it is effective and cheap to compute based on our experiments. To avoid extreme ratios and dramatic change between ratios, we use two kinds of clipping configurable by the user (line 18, 19). On the other hand, the clipping from the LAMB algorithm ( and in (2)) are not used during the compression stage. Then we compute this step’s LAMB scaling coefficient using the 1-bit LAMB scaling ratio and the moving average at the end of warmup, and use it to update the model (line 20, 21).
Figure 3 presents 1-bit LAMB scaling ratios for different layers during BERT pre-training sequence length 128 (sequence length 512 has similar patterns). When comparing with Figure 2, we find that for those layers with less stable varaince (e.g., 3 kinds of embeddings, weights in BertLayer), the corresponding 1-bit LAMB scaling ratios are also larger. As a result 1-bit LAMB is able to adaptively update LAMB scaling coefficients during compression according to the difference between the frozen and fresh variance.
4.2 Reducing number of communications by momentum fusion
Different from Adam, LAMB has the scaling coefficients that need to be updated separately for each layer. For the communication of the compressed momentum during 1-bit LAMB’s compression stage, if we also communicate separately for each layer, the number of communications (which is 302 for BERT-Large model) will greatly affect the overall performance. Thus we fuse all layers’ momentum into a contiguous 1D buffer and just do one communication over this fused momentum. This fused momentum is not a duplication, but a different “view” of all layers’ momentum. We implement this momentum fusion by torch._utils._flatten_dense_tensors and torch._utils._unflatten_dense_tensors.
4.3 Reducing compression error by momentum scaling
We find that after momentum fusion, the compression error increase a lot for some layers which greatly increase the chance of divergence. This is because: 1) When communicating base on the fused momentum, all layers’ momentum will be compressed to the same scale due to the nature of 1-bit compression (representing the tensor by signs and a single scale). Thus the layers with very small/large momentum scale will have larger compression error; 2) Due to LAMB’s layerwise adaptive learning rate, each layer is learning at different speed, which could further increase the variance of momentum scale among layers. We solve this issue by computing an average momentum scale at the end of warmup stage and using it to compute a momentum scale coefficient for each layer. During the compression stage, we multiply each layer’s local momentum by its scale coefficient, then do the compressed communication, then divide each layer’s global momentum by the same scale coefficient. By performing this momentum scaling, all layers’ momentum will have similar scales when passed to the compressed communication. Thus we are able to greatly reduce the compression error and chance of divergence. Since the momentum scale coefficients are fixed during the compression stage, it won’t affect 1-bit compression’s error compensation mechanism.
5 Proposed System Design
To realize the compressed communication at system level, in the 1-bit Adam work Tang et al. (2021) designed a custom collective primitive, called “compressed allreduce”, using Message Passing Interface (MPI) because NCCL library (before v2.7) does not expose either an Alltoall primitive or any point-to-point (send/recv) communication primitives that can be used to implement an Alltoall. In addition to implementing 1-bit LAMB using a MPI-based backend similar to 1-bit Adam, we introduce a new system implementation for compressed communication (applicable to both 1-bit LAMB and 1-bit Adam) using the NCCL backend of PyTorch distributed. This implementation follows the 3-phase design proposed in 1-bit Adam work: 1) The gather step, where each worker sends its -th chunk to worker . We implement this step using NCCL’s Alltoall (introduced in v2.7) and AllGather. 2) The average step, where each worker averages all chunks it receives. 3) The scatter step, where each worker receives the average of all -th chunks from worker . We implement this step using NCCL’s AllGather.
Compared to the MPI-based implementation, this new NCCL-based implementation significantly improves the usability since NCCL is integrated with PyTorch distributed. In addition, our evaluations show that the performance of the NCCL-based implementation is better than the MPI-based implementation for Ethernet-based systems and on-par for InfiniBand-based systems. Thus in the evaluation section we focus on presenting the results with NCCL-based backend, but we also include a performance comparison between MPI and NCCL-based implementations.
6.1.1 Dataset and models
We evaluate the convergence and performance of 1-bit LAMB and uncompressed LAMB for BERT-Large (, , , params) pre-training task. We use the same dataset as Devlin et al. (2019), which is a concatenation of Wikipedia and BooksCorpus with and words respectively. Compared to the original BERT model, one notable change is that we applied PreLN instead of PostLN for better training stability (Zhang and He, 2020; Xiong et al., 2020). We use the GLUE fine-tuning benchmark (Wang et al., 2018) and SQuAD 1.1 fine-tuning task999https://rajpurkar.github.io/SQuAD-explorer/ to evaluate the convergence of the BERT models trained by LAMB and 1-bit LAMB.
We use the two clusters described in Section 3.2. We use 8 to 256 GPUs for BERT pre-training tasks to measure 1-bit LAMB’s performance gain. For fine-tuning tasks we use 4 GPUs for each run. One thing to note is that because 1-bit LAMB introduces additional memory overhead (one persistent copy of the fresh varaince and one temporary copy of last step’s momentum) and because the V100 GPUs on the Ethernet cluster have 16GB memory instead of 32GB, we were not able to fit large batch sizes for BERT pre-training when the number of GPUs is small. For seqlen 128 and 512, we need to use at least 8 and 16 GPUs on the Ethernet cluster, respectively. On the other hand, since the benefit of compression depends on the communication overhead percentage, 1-bit LAMB provides more performance gain as the number of GPUs increase as shown in the evaluation results.
6.1.3 Training parameters
For BERT pre-training, we set the parameters in (2) as , , and for LAMB and 1-bit LAMB. For 1-bit LAMB, we set , , , and in Algorithm 1. For convergence analysis, we set total batch size as for seqlen 128 and for seqlen 512. For performance analysis, we test different batch sizes from to .
For BERT pre-training seqlen 128, the learning rate starts from , exponentially increases to as a warmup in the first 450 steps, then decays into 0.9 of the original after every 250 steps. The total number of steps is 5993. For 1-bit LAMB we use the first 1000 steps (16.7%) as the warmup stage. For BERT pre-training seqlen 512, the learning rate starts from , exponentially increases to as a warmup in the first 150 steps, then decays into 0.9 of the original after every 150 steps. The total number of steps is 555. For 1-bit LAMB we use the first 107 steps (19.3%) as the warmup stage.
For GLUE benchmarks we use Adam optimizer and perform single-task training on the dev set. Following the setup in the BERT paper (Devlin et al., 2019)
, we use a batch size of 32 and fine-tune for 3 epochs for all GLUE tasks. For each task, we select the best learning rate among. For SQuAD fine-tuning we use Adam optimizer and the same parameters as published by HuggingFace (batch size = , learning rate = , dropout = , 2 epochs).
6.2 Convergence analysis
Figure 4 presents the BERT pre-training sample-wise convergence results. For both seqlen 128 and 512, 1-bit LAMB provides the same convergence speed as LAMB, while it takes much less time to process each batch due to communication compression. We already presented the BERT pre-training validation loss and SQuAD fine-tuning results in Section 3.4 Table 3. And Table 4 presents the GLUE results. For both SQuAD and GLUE, results show that 1-bit LAMB provides the same fine-tuning task accuracy as LAMB, while simply freezing both the varaince and LAMB’s scaling coefficients would hurt the accuracy.
|LAMB + basic 1-bit||84.8/84.7||91.2||91.4||92.4||54.0||89.6||84.2||66.6||82.1|
6.3 Performance analysis
Computed as 1/(warmup_ratio + (1 - warmup_ratio)/16) for FP16 training, 1-bit LAMB offers and end-to-end communication volume reduction for BERT-Large seqlen 128 and 512, respectively. To measure the actual end-to-end performance gain, first we perform a throughput analysis where we run the warmup (i.e., baseline LAMB’s performance) and compression stage of 1-bit LAMB for 200 steps each, and measure the average throughput of the two stages. Figure 5 presents the results with NCCL-based compressed communication backend under different batch sizes and number of GPUs. For seqlen 128, 1-bit LAMB provides up to speedup during the compression stage, which is equivalent to end-to-end speedup (computed as 1/(warmup_ratio + (1 - warmup_ratio)/compression_stage_speedup)). For seqlen 512, 1-bit LAMB provides up to speedup during the compression stage, which is equivalent to end-to-end speedup. This demonstrates 1-bit LAMB’s better scalability compared to LAMB. It is also worth mentioning that 1-bit LAMB on Ethernet (4.1 Gbps effective bandwidth, 4 GPUs per node) is able to achieve comparable throughput as LAMB on InfiniBand (near 100 Gbps effective bandwidth, 8 GPUs per node), which demonstrates 1-bit LAMB’s efficiency considering the hardware differences.
In addition to throughput analysis, we also measure the total runtime of pre-training at batch size for both 1-bit LAMB and LAMB. As shown in Table 5, overall 1-bit LAMB is able to provide and speedup for seqlen 128 and 512. These numbers are consistent with the end-to-end speedup calculated in the throughput analysis ( and based on results in Figure 5(d) and 5(g)). For seqlen 128, the end-to-end speedup based on runtime is slightly higher than the speedup based on throughput. We find that it is because uncompressed LAMB’s larger communication volume makes it more sensitive to the occasional fluctuation of the actual network bandwidth.
All of the results above are evaluated with the NCCL-based compressed communicaiton backend implementation. Figure 6 presents the performance comparison between MPI and NCCL-based implementations. Compared to the MPI-based implementation introduced in the 1-bit Adam work, our NCCL-based implementation is able to provide better performance on the Ethernet cluster (where openmpi library is used for MPI backend) and on par performance on the InfiniBand cluster (where MVAPICH2-GDR is used for MPI backend).
|Seqlen 128||Seqlen 512||Total|
|LAMB||657 min||74 min||731 min|
|1-bit LAMB||301 min (2.2x)||50 min (1.5x)||351 min (2.1x)|
To reduce both the number and volume of communications for large-scale training, we propose an error-compensated LAMB preconditioned momentum SGD algorithm, 1-bit LAMB, which combines the power of large batch optimization and communication compression by introducing an novel way to support adaptive layerwise learning rates during communication compression. We also introduce an easier-to-use and more efficient compressed communication backend system based on NCCL. Evaluations show that 1-bit LAMB with NCCL-based backend is able to achieve up to communication volume reduction and up to end-to-end speedup for BERT pre-training compared to uncompressed LAMB, together with the same convergence speed and fine-tuning task accuracy.
- cpSGD: communication-efficient and differentially-private distributed SGD. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7564–7575. Cited by: §2.1.
- QSGD: Communication-Efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1709–1720. Cited by: §2.1.
- Qsparse-local-sgd: distributed sgd with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 14695–14706. Cited by: §2.1.
- SignSGD with majority vote is communication efficient and fault tolerant. In International Conference on Learning Representations, External Links: Cited by: §2.1, §2.1.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Cited by: §1.
- Asynchronous stochastic convex optimization: the noise is in the noise and sgd don t care. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1531–1539. Cited by: §2.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §6.1.1, §6.1.3, Table 4.
Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research12 (61), pp. 2121–2159. External Links: Cited by: §2.2.
- Linearly converging error compensated sgd. External Links: Cited by: §2.1.
- Communication-efficient distributed sgd with sketching. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 13144–13154. Cited by: §2.1.
- A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 2530–2541. Cited by: §2.1.
- Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §2.2.
- Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, External Links: Cited by: §2.1.
- Pipe-sgd: a decentralized pipelined sgd framework for distributed deep net training. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8056–8067. Cited by: §2.1.
Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5330–5340. Cited by: §2.1.
- Distributed sgd with flexible gradient compression. IEEE Access 8 (), pp. 64707–64717. Cited by: §2.1.
- 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, Interspeech 2014 edition. Cited by: §2.1, §2.1.
- Towards more efficient stochastic decentralized learning: faster convergence and sparse communication. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4624–4633. Cited by: §2.1.
- A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Vol. , pp. 2238–2247. Cited by: §2.1.
- Towards scalable distributed training of deep learning on public cloud clusters. In Proceedings of Machine Learning and Systems, Cited by: §2.1.
- Compressing gradient optimizers via Count-Sketches. Proceedings of the 36th International Conference on Machine Learning 97, pp. 5946–5955. Cited by: §2.1.
- Sparsified sgd with memory. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4447–4458. Cited by: §2.1, §2.1.
- Communication-efficient distributed learning via lazily aggregated quantized gradients. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3370–3380. Cited by: §2.1.
- 1-bit adam: communication efficient large-scale training with adam’s convergence speed. External Links: Cited by: §1, §1, §1, §2.1, §2.2, §3.1, §3.2, §3.3, §3.3, §4.1, §5.
RMSprop: divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning. Cited by: §2.2.
- PowerSGD: practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 14259–14268. Cited by: §2.1.
- PowerGossip: practical low-rank communication compression in decentralized deep learning. External Links: Cited by: §2.1.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Cited by: §6.1.1.
- Gradient sparsification for Communication-Efficient distributed optimization. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 1299–1309. Cited by: §2.1.
- TernGrad: ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1509–1519. Cited by: §2.1.
- CSER: communication-efficient sgd with error reset. External Links: Cited by: §2.1.
- On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 10524–10533. External Links: Cited by: §6.1.1.
- Communication-Computation efficient gradient coding. Proceedings of the 35th International Conference on Machine Learning 80, pp. 5610–5619. Cited by: §2.1.
- Large batch optimization for deep learning: training bert in 76 minutes. In International Conference on Learning Representations, External Links: Cited by: §1, §1, §2.2, §3.1, Table 1, Table 3, footnote 4.
- Double quantization for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 4438–4449. Cited by: §2.1.
- ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. External Links: Cited by: §2.2.
- ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 4035–4043. Cited by: §2.1.
- Accelerating training of transformer-based language models with progressive layer dropping. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 14011–14023. External Links: Cited by: §6.1.1.
- Communication-efficient distributed blockwise momentum sgd with error-feedback. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 11450–11460. Cited by: §2.1.
- Asynchronous stochastic gradient descent with delay compensation for distributed deep learning. CoRR abs/1609.08326. External Links: Cited by: §2.1.