DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
view repo
Largescale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeROOffload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency. ZeROOffload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeROOffload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory. ZeROOffload is also designed to scale on multipleGPUs when available, offering near linear speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70 billion parameters on a single DGX2 box, a 4.5x increase in model size compared to using model parallelism alone. By combining compute and memory efficiency with easeofuse, ZeROOffload democratizes largescale model training making it accessible to even data scientists with access to just a single GPU.
READ FULL TEXT VIEW PDFDeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
Since the advent of the attentionbased deep learning (DL) models in 2017, we have seen an exponential growth in DL model size, fueled by substantial quality gains that these attention based models can offer with the increase in the number of parameters. For example, the largest language model in literature had less than 100M parameters in 2017, it grew to over 300M with BERT
[devlin2018bert]in 2018, increasing to tens of billions in 2019 with models such as GPT2
[cao2020pretrained], T5 [raffel2020exploring], MegatronLM [MegatronLM] and TuringNLG [turingnlg]. Today, the largest language model GPT3 [brown2020language] has a staggering number of 175B parameters. With the three orders of magnitude growth in model size since 2017, the model accuracy continues to improve with the model size [openaiscaling]. Recent studies in fact show that larger models are more resourceefficient to train than smaller ones [openaiscaling] for a given accuracy target. As a result, we expect the model size to continue growing in the future. However, accessibility to large model training is severely limited by the nature of stateofart system technologies. Those technologies make entry into the large model training space prohibitively expensive. To be more specific, distributed parallel DL training technologies such as pipeline parallelism [huang2018gpipe], model parallelism [MegatronLM], and ZeRO [zero] (Zero Redundancy Optimizer) allow transcending the memory boundaries of single GPU/accelerator device by splitting the model states (parameters, gradients and optimizer states) across multiple GPU devices, enabling massive models that would otherwise simply not fit in a single GPU memory. All recordbreaking large models such as GPT2, MegatronLM, TuringNLG, GPT3, were trained using a combination of the aforementioned technologies. However, all of these DL parallel technologies require having enough GPU devices such that the aggregated GPU memory can hold the model states required for the training. For example, training a 10B parameter model efficiently requires a DGX2 equivalent node with 16 NVIDIA V100 cards, which cost over , beyond the reach of many data scientists, and even many academic and industrial institutions. Heterogeneous DL training is a promising approach to reduce GPU memory requirement by exploiting CPU memory. Many efforts have been made in this direction [vDNN_micro16, SuperNeurons_ppopp18, Layrub, SwapAdvisor_asplos20, AutoSawp, AutoTM_asplos20, Capuchin_asplos20, hpca21:ren, isca20:buddy_compression, hpca21:ren]. Nearly all of them target CNN based models, where activation memory is the memory bottleneck, and model size is fairly small (less than 500M). However, the primary memory bottleneck for recent attention based large model training are the model states, instead of activation memory. There is an absence in literature studying these workloads for heterogeneous DL training. Additionally, existing efforts on heterogeneous training are further limited in two major ways: i) nearly all of them exploit CPU memory, but not CPU compute, which we show can be used to significantly reduce the CPUGPU communication overhead, and ii) they are mostly designed for and evaluated on single GPU [Layrub, SwapAdvisor_asplos20, hpca21:ren, AutoSawp], without a clear path to scaling efficiently on multiple GPUs that is crucial for large model training. Addressing the aforementioned limitation, we attempt to democratize large model training by developing ZeROOffload, a novel heterogeneous DL training technology designed specifically for large model training. ZeROOffload exploits both CPU memory and compute for offloading, while offering a clear path towards efficiently scaling on multiple GPUs by working with ZeROpowered data parallelism [zero]. Additionally, our first principle analysis shows that ZeROOffload provides an optimal and the only optimal solution in maximizing memory saving while minimizing communication overhead and CPU compute overhead for large model training. ZeROOffload is designed around three main pillars: i) Efficiency, ii) Scalabilty, and iii) Usability. Efficiency: The offload strategy is designed with the goal of achieving comparable compute efficiency to the stateofart nonoffload strategies but for significantly larger models. To achieve this goal, we rely on first principle analysis to identify a unique optimal computation and data partitioning strategy between CPU and GPU devices that is optimal in three key aspects: i) it requires ordersofmagnitude fewer computation on CPU compared to GPU, preventing the CPU compute from becoming a performance bottleneck, ii) it minimizes the communication volume between CPU and GPU preventing communication from being a bottleneck, and iii) it provably maximizes memory savings on GPU while achieving minimum communication volume. Our analysis shows that to be optimal in the aforementioned regards, we must offload the gradients, optimizer states and optimizer computation to CPU, while keeping the parameters and forward and backward computation on GPU. This strategy enables a 10x increase in model size, with minimum communication and limited CPU computation, which allows us to train 13B parameters on a single NVIDIA V100 GPU at 40 TFLOPS, compared to 30 TFLOPS on the same GPU with 1.2B parameters, the largest model that can be trained without any CPU offloading. Offloading optimizer computation requires CPU to perform computation compared to on GPU where and are the model size and batch sizes. In most cases, the batch size is large, and CPU computation is not a bottleneck, but for small batch sizes, the CPU compute can be a bottleneck. We address this using two optimizations: i) an efficient CPU optimizer that is up to 6x faster than thestateofart , and ii)Onestep delayed parameter update that allows overlapping the CPU optimizer step with GPU compute, while preserving accuracy. Together, they preserve efficiency for ZeROOffload even with small batch sizes. Scalability: Good scalability is crucial to take advantage of multiple GPUs that may be available to some data scientists. In the DL community, data parallelism is generally used as the de facto standard to scale DL training to multiple GPUs [NIPS2012_4687, parallel_sgd, jmlr20_data_parallelism]. However, it is not designed to work with heterogeneous training and presents scalability challenges because of the replication of data and computation in data parallel training. Data parallel training replicates all the model states such as optimizer states, parameters, and gradients, and it also replicates the optimizer computation on each GPU. Therefore, offloading model states or optimizer computation to CPU in combination with data parallelism will result in significant replication of communication and CPU compute: increase the CPU memory requirement proportionally to the data parallelism degree while limiting throughput scalability due to the increased communication. To address these challenges, ZeROOffload combines our offload strategy with ZeRO [zero] powered data parallelism instead of traditional data parallelism. The symbiosis allows ZeROOffload to maintain a single copy of the optimizer states on the CPU memory regardless of the data parallel degree. Furthermore, it keeps the aggregate communication volume between GPU and CPU, as well as the aggregate CPU computation a constant regardless of data parallelism, allowing ZeROOffload to effectively utilize the linear increase in CPU compute with the increase in the data parallelism degree. As a result, ZeROOffload achieves excellent scalability on up to 128 GPUs. In addition to working with ZeRO powered data parallelism, ZeROOffload can be combined with model parallelism [meshtensorflow
, MegatronLM] to achieve higher memory savings, when multiple GPUs are available. Usability: ZeROOffload is available as part of an OpenSource PyTorch library, DeepSpeed (www.deepspeed.ai). Unlike most strategies discussed in Section 2, ZeROOffload does not require model refactoring to work. In fact, PyTorch users can enable ZeROOffload with few lines of code change to their existing training pipeline as shown in Figure 1, allowing to train 10x larger models easily. For detailed tutorial, please see: https://www.deepspeed.ai/tutorials/zerooffload/.Contributions Our contributions are as follows:
[leftmargin=*]
Highly scalable multiGPU design through i) a symbiotic combination of offload strategy with ZeRO powered data parallelism (Sec. 4.2), allowing ZeROOffload to achieve near linear scalability, and ii) seamless integration with modelparallel training [MegatronLM], enabling even larger models than using ZeROOffload or model parallelism alone (Sec. 4.2).
Opensource implementation of ZeROOffload in PyTorch.
Extensive evaluation demonstrating i) Model Scale: 10x increase in model size with up to 13B on a single GPU and 4x increase in model size over model parallelism with up to 70 B parameters on a DGX2 node. ii) Efficiency: Over 40 TFlops for a 10B parameters on a single NVIDIA V100, compared to 30 TFLOPS on the same GPU with 1.2B parameters, the largest model that can be trained without any CPU offloading. iii) Scalability: Nearperfect linear scalability for a 10B parameter model on up to 128 GPUs. iv) CPU overhead reduction with our ADAM implementation with 6x speedup over PyTorch optimizer and up to 1.5X improvement in endtoend throughput with delayed parameter update optimizations (Sec. 6).
Memory consumption in large model training.
The full spectrum of memory consumption during DNN model training can be classified into two parts: i) model states and ii) residual states
[zero]. Model states include parameters, gradients, and optimizer states (such as momentum and variances in Adam
[adam]); Residual states include activations, temporary buffers, and unusable fragmented memory. Model states are the primary source of memory bottleneck in large model training. We consider the memory consumption due to model states for large transformer models such as MegatronLM (8 billion) [MegatronLM], T5 (11 billion) [raffel2020exploring], and TuringNLG [turingnlg] (17.2 billion). They are trained with float16 mixed precision training [mixed_percision_training] and Adam optimizer [adam]. Mixed precision training often keeps two copies of the parameters, one in float16 (fp16) and the other in float32 (fp32). The gradients are stored in fp16. In addition to the parameters and gradients, the Adam optimizer keeps track of the momentum and variance of the gradients. These optimizer states are stored in fp32. Therefore, training a model in mixed precision with the Adam optimizer requires at least 2 bytes of memory for each fp16 parameter and gradient, and 4 byte of memory for each fp32 parameter, and the moementum and variance of each gradient. In total, a model with parameters requires bytes of memory. Therefore, the model states for MegatronLM, T5 and TuringNLG require 128 GB, 176 GB and 284 GB, respectively, which are clearly beyond the memory capacity of even the current flagship NVIDIA A100 GPU with 80 GB of memory. Significant amount of work has been done in the recent years to enable large model training, which requires more memory than what is available on a single GPU to fit these model and residual states. These efforts can be classified broadly into two categories: i) scaleout training and ii) scaleup training based approaches. We discuss them as follows. Scale out large model training. Scaleout training uses aggregate memory of multiple GPUs to satisfy the memory requirement for large model training. Two prominent examples of scale out training is model parallelism [NIPS2012_4687, MegatronLM] and pipeline parallelism [huang2018gpipe, harlap2018pipedream], both partitioning the model states and the residual states across multiple GPUs. Model parallelism [NIPS2012_4687, MegatronLM] partitions the model vertically and distributes the model partitions to multiple GPU devices in order to train large models. Pipeline parallelism [huang2018gpipe, harlap2018pipedream] on the other hand parallelizes model training by partitioning the model horizontally across layers. Both of these approaches must change the user model to work, therefore can limit usability. A recent work, ZeRO [zero], provides an alternative to model and pipeline parallelisms to train large models. ZeRO splits the training batch across multiple GPUs similar to data parallel training [NIPS2012_4687, parallel_sgd, jmlr20_data_parallelism], but unlike data parallel training which replicates all the model states on each GPU, ZeRO partitions them across all GPUs, and uses communication collectives to gather individual parameters as needed during the training. ZeRO does not require changes to the user model to work, making it more generic than model or pipeline parallel training. It also offers better compute efficiency and scalability. Despite the ability of model and pipeline parallelisms, and ZeRO to train large models, they all require multiple GPUs such that the aggregate GPU memory can hold the model and residual states for training large models. In contrast, ZeROOffload is designed to fit a larger model by offloading model states to CPU memory and can train a 10x larger model on a single GPU without sacrificing efficiency. When multiple GPUs are available, ZeROOffload is designed to work together with ZeRO to offer excellent scalability, or in conjunction with model parallelism to fit even larger model sizes that is not possible with ZeROOffload or model parallelism alone. Scale up large model training. Existing work scales up model size in a single GPU through three major approaches. The first approach trades computation for memory saving from activations (residual memory) by recomputing from checkpoints [chen2016training]. The second approach uses compression techniques such as using low or mixed precision [mixed_percision_training] for model training, saving on both model states and activations. The third approach uses an external memory such as the CPU memory as an extension of GPU memory to increase memory capacity during training [vDNN_micro16, SuperNeurons_ppopp18, AutoTM_asplos20, SwapAdvisor_asplos20, Capuchin_asplos20, Layrub, hpca21:ren]. Our work, ZeROOffload falls under the third approach. Unlike ZeROOffload, the above efforts only offload data to CPU but not compute, and they use smaller models training. In contrast, a recent work called L2L [layertolayer]can enable multibillion parameter training by managing memory usage in GPU layer by layer. In particular, L2L synchronously moves tensors needed in the upcoming layer into GPU memory for computation and keeps the rest of tensors into CPU memory for memory saving. In comparison on ZeROOffload, it offers limited efficiency due to extra communication overhead, does not offer a way to scale out across devices, and requires model refactoring, making it difficult to use.
ZeRO powered data parallel training. ZeROOffload works with ZeRO to scale DL training to multiple GPUs. ZeRO has three stages, ZeRO1, ZeRO2 and ZeRO3 corresponding to the partitioning of the three different model states, optimizer states, gradients and parameters, respectively. ZeRO1 partitions the optimizer states only, while ZeRO2 partitions gradients in addition to optimizer states, and ZeRO3 partitions all model states. ZeROOffload works symbiotically with ZeRO2, and therefore we discuss it further. In ZeRO2, each GPU stores a replica of all the parameters, but only updates a mutually exclusive portion of it during the parameter update at the end of each training step. As each GPU only updates a portion of the parameters, they only store optimizer states and gradients required to make that update. After the update, each GPU sends its portion of the updated parameters to all the other GPUs using an allgather communication collective. ZeRO2 computation and communication schedule is described below: During the forward pass, each GPU computes the loss with respect to a different minibatch. During the backward propagation, as each gradient is computed, it is averaged using a reduce operator at the GPU/GPUs that owns the gradient or part of the gradient. After the backward pass, each GPU updates its portion of the parameters and optimizer states using the averaged gradients corresponding to that portion. After this, an allgather is performed to receive the rest of the parameter update computed on other GPUs.ZeROOffload is designed to enable efficient large model training on a single or multiple GPUs by offloading some of the model states from GPU to CPU memory during training. As discussed in Sec. 2, model states: parameters, gradients, and the optimizer states, are the primary source of memory bottleneck in large model training. By offloading some of these model states to CPU, ZeROOffload can enable training of significantly larger models ^{1}^{1}1ZeROOffload only offloads model states. Offloading secondary sources of memory bottleneck such as activation memory is beyond scope of our offload strategy. Given that they are significantly smaller than model states, we ignore them for the purpose of our analysis. Furthermore, the first and second approaches described in Sec. 2 can be used in conjunction with ZeROOffload to reduce activation memory . However, identifying the optimal offloading strategy is nontrivial. There are numerous ways to offload model states to CPU memory, each with a different tradeoff in terms of CPU computation, and GPUCPU communication, both of which can limit the training efficiency. To identify the optimal offload strategy, ZeROOffload models the DL training as dataflow graph and uses first principle analysis to efficiently partition this graph between CPU and GPU devices. ZeROOffload partitions the graph in a way that is optimal in three key aspects: i) it requires ordersofmagnitude fewer computation on the CPU compared to GPU that prevents CPU from becoming a performance bottleneck, ii) it guarantees the minimization of communication volume between CPU and GPU memory, and iii) it provably maximizes the memory savings while achieving minimum communication volume. In fact, ZeROOffload can achieve high efficiency during training that is comparable to nonoffload training and it is unique optimal, meaning no other solution can offer better memory savings without increasing the communication volume or increasing CPU computation. In this section, we discuss the derivation of our unique optimal offload strategy. Our strategy is specifically designed for mixed precision training with Adam optimizer which is the de facto training recipe for large model training.
The DL training workload can be represented as a weighted directed graph of data and computation, as shown in the figure 2, where the circular nodes represents model states (parameter16, gradient16, parameter32, momentum32, variance32), and the rectangular nodes represents computation (forward, backward, param update). The edges in the graph represents the data flow between the nodes, and the weight of an edge is the total data volume in bytes that flows through it during any given training iteration. For a model with M parameters, the weight of the edges in this graph is either where the source node produces fp16 model states, or where the source node produces fp32 model states. An offload strategy between GPU and CPU can be represented using a twoway partitioning of this graph, such that computation nodes in a partition would be executed on the device that owns the partition, and the data nodes in the partition will be stored on device that owns the partition. The total data volume that must be communicated between the GPU and CPU is given by the weight of edges running across two partitions. There are numerous ways to partition this graph. In the following sections, we use first principles to simplify the data flow graph to reduce the number possible choices based on three different efficiency metric: i) CPU computation overhead, ii) communication overhead, and iii) memory savings.
The CPU computation throughput is multiple orders of magnitude slower than the GPU computation throughput. Therefore, offloading large computation graph to CPU will severely limit training efficiency. As such, we must avoid offloading compute intensive components to the CPU. The compute complexity of DL training per iteration is generally given by , where is the model size and is the effective batch size. To avoid CPU computation form becoming a bottleneck, only those computations that have a compute complexity lower than should be offloaded to the CPU. This means that the forward propagation and backward propagation both of which have a compute complexity of must be done on the GPU, while remaining computations such as norm calculations, weight updates etc that have a complexity of maybe offloaded to the CPU. Based on this simple observation we fuse the forward and backward nodes in our data flow graph into a single supernode (FWDBWD) and assign it on the GPU.
The CPU memory bandwidth is at least an order of magnitude faster than the PCIE bandwidth between CPU and GPU, while the GPU memory is another order of magnitude faster than even the CPU memory. Therefore, we must minimize the communication volume between CPU and GPU memory to prevent the PCIE bandwidth from becoming a training performance bottleneck. To do so we must first identify the theoretical minimum communication volume for a model state offload strategy. The minimum communication volume for any model state offload strategy is given by ^{2}^{2}2Please note that it is possible to reduce the communication volume further by only offloading partial model states. For simplification, we assume that an offload of a model state implies that we offload the entire model state. Our analysis on the memory savings per communication volume, still holds even if we offload partial model states. Note that after fusing the forward and backward into a single supernode as discussed in Sec. 3.2, each node in our data flow graph is part of a cycle. Therefore, any partitioning of this graph would require cutting at least two edges, each of which have a edge weight of at least , resulting in a total communication of at least . If we choose to limit the communication volume to this bare minimum, we can greatly simplify our dataflow graph and reduce the number of partitioning strategies to a handful: Creating fp32 supernode: Notice that any partitioning strategy that does not colocate the fp32 model states its producer and consumer nodes cannot achieve the minimum communication volume of . Such a partition must cut at least one edge with a weight of , and the other with at least , resulting in a communication volume of at least . Therefore, to achieve the minimum communication volume, all offload strategies must colocate fp32 model states with its producer and consumer operators, i.e., the fp32 model states (momentum 32, variance 32 and p32) must be colocated with the Param Update and the float2half computations. This constraint allows us to treat all the aforementioned fp32 data and compute nodes in the data flow graph as a single supernode that we refer to as Update Super. We show this reduced data flow graph in figure 2, consisting of only four nodes: FWDBWD Super node, p16 data node, g16 data node, and Update Super node. p16 assignment: To achieve the minimum communication volume, p16 must be colocated with FWDBWD Super because the edge weight between these two nodes is . Separating these two nodes, would increase the communication volume to . Since, we have already assigned node FWDBWD Super to GPU to limit computation on CPU, p16 must also be assigned to GPU.
After simplifying the data flow graph to minimize communication volume, only g16 and Update Super remain to be assigned. Notice that at this point, all partitions will result in minimum communication volume, so we can prune the choices further to maximize the memory savings on the GPU. Table 1 shows the memory savings of all valid partitioning strategies that minimize the communication volume. The maximum memory savings of 8x can be achieved by offloading both g16 and Update Super to CPU.
FWDBWD  p16  g16  Update  Memory  Reduction 

gpu  gpu  gpu  gpu  16M  1x (baseline) 
gpu  gpu  cpu  gpu  14M  1.14x 
gpu  gpu  gpu  cpu  4M  4x 
gpu  gpu  cpu  cpu  4M  8x 
ZeROOffload allocates all the fp32 model states along with the fp16 gradients on the CPU memory, and it also computes the parameter updates on the CPU. The fp16 parameters are kept on the GPU and the forward and backward computations are also done on the GPU. We arrive at this offload strategy by simplifying our data flow graph and eliminating all other partitioning strategies as they do not limit CPU computation, minimize communication volume, or maximize memory savings. Therefore, ZeROOffload is not only optimal in terms of the aforementioned metrics, it is also unique; there can be no other strategy that can offer more memory savings than ZeROOffload without increasing the compute complexity on the CPU or incur additional GPUCPU communication volume.
In this section, we discuss the concrete computation and communication schedule for implementing ZeROOffload on a single GPU system based on our offload strategy. We then show how we extend this schedule to work effectively on multiGPU systems by combining our offload strategy with ZeRO data parallelism and model parallelism.
As discussed in Sec. 3
, ZeROOffload partitions the data such that the fp16 parameters are stored in GPU while the fp16 gradients, and all the optimizer states such as fp32 momentum, variance and parameters are stored in CPU. During the training, we begin by computing the loss via the forward propagation. Since the fp16 parameters are already presented on the GPU, no CPU communication is required for this part of the computation. During the backward propagation on the loss, the gradient for different parameters are computed at different point in the backward schedule. ZeROOffload can transfer these gradients for each parameter individually or in small groups to the CPU memory immediately after they are computed. Therefore, only a small amount of memory is required to temporarily hold the gradients on the GPU memory before they are transferred to CPU memory. Furthermore, each gradient transfer can be overlapped with the backpropagation on the remainder of the backward graph, allowing ZeROOffload to hide a significant portion of the communication cost. After the backward propagation, ZeROOffload updates the fp32 parameters and the remaining optimizer states (such as momentum and variance) directly on the CPU, and copies the updated fp32 parameters from the CPU memory to the fp16 parameters on the GPU memory. Figure
3 shows the computation and communication in each step of ZeROOffload diagrammatically, and Figure 5 shows the concrete schedule as a pseudocode.ZeROOffload in its entirety is a symbiotic integration of ZeROOffload strategy described in Sec. 3 and ZeROpowered data parallelism discussed in Sec. 2, which allows ZeROOffload to scale to hundreds of GPUs efficiently. ZeROOffload preserves the model state partitioning strategy of ZeRO Stage2 (optimizer state and gradient partitioning), while offloading the partitioned gradients, optimizer states and the corresponding parameter updates to the CPU. The key benefit of doing this partitioning before offloading is that for systems with more than 1 GPU, each data parallel process is only responsible for updating a subset of the parameters. The aggregated communication volume from all the data parallel GPUs to the CPU remains constant, and CPU resources are used in parallel to jointly compute a single weight update. As a result the total CPU update time decreases with increased data parallelism, since the CPU compute resources increase linearly with the increase in the number of compute nodes. This allows ZeROOffload to achieve very good scalability, as the overhead of communication across GPUs is offset by the reduction in the CPU optimizer step. ZeROOffload partitions gradients and optimizer states among different GPUs, and each GPU offloads the partition it owns to the CPU memory and keeps it there for the entire training. During the backward propagation, gradients are computed and averaged using reducescatter on the GPU, and each GPU only offloads the averaged gradients belonging to its partition to the CPU memory. Once the gradients are available on the CPU, optimizer state partitions are updated in parallel by each data parallel process directly on the CPU. After the update, parameter partitions are moved back to GPU followed by an allgather operation on the GPU similar to ZeRO2 to gather all the parameters. Figure 4 shows the data placement model parameters, gradients and optimizer states for ZeROOffload and the details of the ZeROOffload data parallel schedule is presented in Figure 5. The all gather operation described above is shown as a sequence of broadcast operations in the Figure. Model Parallel training ZeROOffload can also work together with tensorslicing based model parallelism (MP) frameworks such as MegatronLM [MegatronLM]. It does so by offloading the gradients, optimizer states and the optimizer computation corresponding to each MP process allowing ZeROOffload to train significantly larger models than possible than using model parallelism alone. Sec. 6 provides more details.
We speedup the CPU execution time for the parameter updates with two optimizations. First, we implement a fast CPU Adam optimizer using high performance computing techniques offering significant speedup over stateofart Pytorch implementation. Second, we develop a onestep delayed parameter update schedule that overlaps the CPU parameter update computation with the forward and backward computation on the GPU, hiding the CPU execution time when enabled.
We use three levels of parallelism for improving the performance of the CPU optimizer.. 1) SIMD vector instruction
[SIMDvector] for fully exploiting the hardware parallelism supported on CPU architectures. 2) Loop unrolling [loopunrolling], an effective technique for increasing instruction level parallelism that is crucial for better memory bandwidth utilization. 3) OMP multithreading for effective utilization of multiple cores and threads on the CPU in parallel. Using these technique, we present a significantly faster implementation of Adam optimizer compared to stateofart PyTorch implementation.ADAM is an optimization algorithm used for deeplearning training, which takes the loss gradients together with their first and second momentums to update the parameters. Therefore, in addition to the model parameters, ADAM requires two more matrices of the same size () saved during the training. In the mixed precision training mode, there are two versions of the parameters stored in memory: one in FP16 (p16) used for computing the activations in the forward pass (on GPU), and one master copy in FP32 (p32) which is updated by the optimizer (on CPU). The p16 is updated with the p32 through casting, at each training step. Moreover, the momentum and variance of the gradients are saved in FP32 (on CPU), to prevent the precision loss for updating the parameters. Please refer to [adam] for further detail on ADAM’s algorithm.
Optimized Implementation Algorithm 2 elaborates the ADAM’s implementation detail using SIMD operations. As shown, the Adam function receives the optimizer parameters such as , , and , and the gradient, momentum, variance and master copy of parameters (p32) as the input. We also use some parameters specific to the implementation, like the and . The Adam optimizer sends back the updated variance, momentum, and parameter in both FP16 (to GPU) and FP32 (to CPU) . We firstly read the data, including parameter, gradient, momentum and variance, into the vector registers (line 7). Then, we use several fused multiplyadd (FMA) vector operations to preform the main execution pipeline which is repeated by the unrolling width. Note that the rest of operations, such as multiply, division, and sqrt, also run in vector mode. For the best performance we use AVX512 simd instruction set and an of 8 based on autotuning results. In addition to the CPUAdam optimizer, we implement the CPUtoGPU 16FP parametercopy in a tiled manner (line 15). We overlap the CPU and GPU execution by parallelizing the Adam computation and copying the parameters over to GPU. As we process Adam computation of the current tile of data on CPU, we write the parameters back to GPU for the previously processed tile. This way, we reduce the idle time of GPU to start the processing of the next training step.
Despite using a highly optimized CPU optimizer, the CPU computation overhead can become a bottleneck during training with very small batch sizes, when the GPU computation time is not much larger than CPU compute. For such limited cases, we develop onestep delayed parameter update (DPU) that overlaps CPU and GPU compute to hide the CPU computation overhead by delaying the parameter update by a single step. We verify that DPU does not impact the final accuracy of training in the evaluation. DPU training schedule Figure 6 shows the workflow of ZeROOffload training process with delayed parameter update. ➊ The first steps, are trained without DPU to avoid destabilizing the training during the early stages where gradients change rapidly. ➋ On step , we obtain the gradients from the GPU, but we skip the CPU optimizer step, and do not update the fp16 parameters on the GPU either. ➌ At step , we compute the parameter updates on the CPU using gradients from step , while computing the forward and backward pass on the GPU in parallel using parameters updated at step . From this step onwards, the model at step will be trained using the parameters updated with gradients from step instead of parameters updated at step, overlapping CPU compute with GPU compute. Accuracy tradeoff. Since DPU changes the semantics of the training, it is reasonable to ask if there is a tradeoff between model accuracy and training efficiency. To answer this question, we evaluated DPU on multiple training workloads and found that DPU does not hurt convergence if we introduce DPU after a few dozen iterations instead of introducing it at the beginning. Our evaluation result in Sec. 6 shows that compared with training with ZeROOffload only, training with delayed parameter update achieves same model training accuracy with higher training throughput.
This section seeks to answer the following questions, in comparison to the stateoftheart:
[label=()]
How does ZeROOffload scale the trainable model size compared to existing multibillion parameter training solutions on a single GPU/DGX2 node?
What is the training throughput of ZeROOffload on single GPU/DGX2 node?
How does the throughput of ZeROOffload scale on up to 128 GPUs?
What is the impact of our CPUAdam and delay parameter update (DPU) on improving throughput, and does DPU change model convergence?
DGX2 node  

GPU  16 NVIDIA Tesla V100 Tensor Core GPUs 
GPU Memory  32GB HBM2 on each GPU 
CPU  2 Intel Xeon Platinum 8168 Processors 
CPU Memory  1.5TB 2666MHz DDR4 
CPU cache  L1, L2, and L3 are 32K, 1M, and 33M, respectively 
PCIe  bidirectional 32 GBps PCIe 
Testbed. For the evaluation of model scale and throughput, we conduct our experiments on a single DGX2 node, whose details are shown in Table 2. For the evaluation of throughput scalability, we conduct experiments on 8 Nvidia DGX2 nodes connected together with InfiniBand using a 648port Mellanox MLNXOS CS7500 switch. Workloads. For the performance evaluation, we focus on evaluating GPT2 [gpt2] like Transformer based models [transformer]. We vary the hidden dimension and the number of Transformer blocks to obtain models with a different number of parameters. Note that scaling the depth alone is often not sufficient because it would make training more difficult [openaiscaling]. Table 3 shows the configuration parameters used in our experiments. For convergence analysis, such as the delayed parameter update, we use GPT2 [gpt2] and BERT [devlin2018bert]
, both of which are commonly used as pretrained language models and have demonstrated superior performance in many NLP tasks (e.g., natural language understanding and inference) than recurrent neural networks or convolutional neural networks. We use BERTlarge, same as the one from
[devlin2018bert], which has 24layer, 1024hidden, 16heads, and 336M parameters. Similar to [zero, MegatronLM], we finetune BERT on the Stanford Question Answering Dataset (SQuAD)
[SQuDAranklist], which is one of the most widely used reading comprehension benchmark [squda]. Unless otherwise stated, we follow the same training procedure and hyperparameter settings as in
[devlin2018bert, gpt2]. Baseline. We compare the effectiveness of ZeROOffload with stateofarts multibillion parameter training solutions:PyTorch DDP: This is the existing PyTorch Transformer implementation using DistributedDataParallel [pytorchddp].
Megatron [MegatronLM]: One of the current stateoftheart multibillion parameter model training solutions, which employs model parallelism to train up to 8.3B parameter models using 512 GPUs.
L2L [layertolayer]
: L2L enables training of deep Transformer networks by keeping one Transformer block at a time in GPU memory and only moves tensors in the upcoming Transformer block into GPU memory when needed.
ZeRO [zero]: ZeRO extends data parallelism by eliminating memory redundancies across multiple GPUs, allowing to train models up to 170B parameters with high training throughput using 25 DGX2 nodes. We refer to the opensourced implementation of ZeRO as ZeRO2. ZeRO2 achieves the SOTA results for large model training and is a strong baseline.
# params 


# layer  hidden size  

1, 2 billion  32  1  20, 40  2048  
4 billion  32  1  64  2304  
6, 8 billion  16  1  53, 72  3072  
10,11 billion  10,8  1  50,55  4096  
12, 13 billion  4  1  60, 65  4096  
15 billion  8  2  78  4096  
20,40,60 billion  8  2  25,50,75  8192  
70 billion  8  8  69  9216 
As an important step toward democratizing large model training, in this part, we first test the largest trainable models on a single GPU as well as 16 GPUs in a single DGX2 node.
The largest model can be trained using PyTorch DDP on a single GPU with 32GB memory is 1.4B, before running out of memory, as shown in 9. Both Megatron and ZeRO2 do not increase the trainable model size on a single GPU in comparison to PyTorch, because they both utilize the aggregated GPU memory to fit larger models. In contrast, ZeROOffload enables 13B model training on a single GPU, which is more than 9X larger than using PyTorch, Megatron, and ZeRO2. This is mainly because of ZeROOffload’s strategy for maximizing the memory savings on GPU by offloading expensive states such as optimizer states and the majority of gradients to CPU memory. On the other hand, L2L is able to train even larger models (e.g., 17B) on a single GPU by frequently moving weights from unused layers to CPU memory. However, the largest model size does not increase when training L2L with multiple GPUs, which is discussed next.
We further perform model scale tests with 4 and 16 GPUs in a single DGX2 node, respectively. As shown, the maximum trainable model size stays the same for both PyTorch and L2L, because both of them do not handle memory redundancies in data parallelism. As a result, their scalability is bounded by the model scale on a single GPU. Both Megatron and ZeRO2 support large model training with more GPUs, but they cannot scale efficiently beyond 15B parameters, even with 16 GPUs. Megatron supports larger models than ZeRO2, because ZeRO2 still incurs memory redundancies on model weights. On the other hand, ZeROOffload easily enables training of up to 70B parameter models by partitioning and offloading optimizer states and gradients to CPU memory combined with model parallelism. Overall, ZeROOffload increases the model scale on a single DGX2 node by 50X, 4.5X, 7.8X, and 4.2X than using PyTorch, Megatron, ZeRO2, and L2L, respectively.
Next, we compare the training throughput of ZeROOffload and L2L, for models with billionscale parameters, on a single GPU. We do not include Megatron and ZeRO2 in this comparison, because both of them cannot train models bigger than 1.4B parameters due to OOM. We evaluate ZeROOffload and L2L with the same training batch size (e.g., 512) and same microbatch sizes (shown in table 3), with gradient accumulation enabled. We also disable delayed parameter update in this experiment so that the comparison is only from the system efficiency perspective. We evaluate the performance improvement and its impact on the convergence of delayed parameter update in Section 6.2.4. Figure 9 shows that ZeROOffload outperforms L2L by 14% on average (up to 22%) in throughput (TFLOPS). The performance benefit of ZeROOffload comes from the following two aspects. First, ZeROOffload has a lower communication cost between CPU and GPU than L2L. For a model with parameters, L2L requires data communication volume between GPU and CPU, which is a sum of the weights, gradients, and optimizer states of each layer of the model. As analyzed in Sec. 4.1, the communication volume between CPU and GPU memory in ZeROOffload is , which is 7x smaller than L2L. The reduced communication volume significantly mitigates the bottleneck from CPUGPU communication. Second, compared with L2L, the parameter update of ZeROOffload happens on CPU instead of GPU, but our optimized CPUAdam implementation achieves a quite comparable parameter update performance than the PyTorch Adam implementation on GPU (evaluated in Sec. 6.2.4). Therefore, although the optimizer update on GPU in L2L is slightly faster than the optimizer update on CPU in ZeROOffload, the communication overhead introduced by L2L leads to an overall slower throughput than ZeROOffload. MultiGPU in single DGX2. Next, we compare the training throughput of PyTorch, ZeRO2, Megatron, ZeROOffload without model parallelism (w/o MP), and ZeROOffload with model parallelism (w/ MP) in one DGX2 node. When using MP, we use a MP degree that gives the best performance for both baseline and ZeROOffload. We use a total batch size of 512 for all the experiments using a combination of microbatch per GPU and gradient accumulation. To get the best performance for each configuration, we use the largest micro batch that it can support without OOM. We exclude L2L [layertolayerPyTorch] in this test because its implementation does not support multiGPU training. Figure 10 shows the throughput per GPU results when training on multiple GPUs. We make the following observations:
For 1B to 15B models, ZeROOffload achieves the highest throughput and has up to 1.33X, 1.11X, 1.64X higher speeds than PyTorch, ZeRO2, and Megatron, respectively. By offloading all the optimizer states to CPU with low overhead, ZeROOffload can train with larger microbatch sizes giving higher throughput.
ZeRO2 runs out of memory once the model size is beyond 8B due to lack of enough aggregate GPU memory to store the model states on 16 GPUs. Instead, ZeROOffload scales to 13B, without model parallelism because it offloads optimizer states and the majority of gradients to CPU memory.
When combined with model parallelism, ZeROOffload enables training up to 70B parameter models with more than 30 TFLOPS throughput per GPU. In contrast, Megatron supports only up to 15B parameter models before running out of memory, using just model parallelism.
Compared ZeROOffload with ZeRO2 and Megatron, ZeROOffload outperforms ZeRO2 and Megatron in throughput for 1–8B and 1–13B parameter models, respectively. ZeROOffload is faster than Megatron, because it eliminates frequent communication between different GPUs and can train with larger micro batch sizes. ZeROOffload outperforms ZeRO2 also due to larger micro batch sizes.
We compare the throughput scalability of ZeRO2 and ZeROOffload ^{3}^{3}3We do not include comparison against Megatron because it consistently performs worse than ZeROOffload, as shown in Figure 10. Given the communication overhead added by model parallelism, scaling out Megatron training can not achieve higher throughput than ZeROOffload even with linear scalability. on up to 128 GPUs in Figure 11 and make the following key observations: First, ZeROOffload achieves near perfect linear speedup in terms of aggregate throughput (green line) running at over 30 TFlops per GPU (blue bars). Second, from 1 to 16 GPUs, while ZeRO2 runs out of memory, ZeROOffload can effectively train the model, turning the model training from infeasible to feasible. Third, with 32 GPUs, ZeROOffload slightly outperforms ZeRO2 in throughput. The improvement comes from additional memory savings on GPU from ZeROOffload, which allows training the model with larger batch sizes that lead to increased GPU computation efficiency. Fourth, with more GPUs (such as 64 and 128), ZeRO2 starts to outperform ZeROOffload, because both can now run similar batch sizes, achieving similar computation efficiency, whereas ZeRO2 does not suffer from the additional overhead of CPUGPU communication. In summary, ZeROOffload complements ZeRO2 and enables large model training from a single device to thousands of devices with good computation efficiency.
In this part, we evaluate our Adam implementation against the PyTorch Adam on CPU. Table 4 shows the optimizer execution time of the two implementations for model parameters from 1 to 10 billion. Compared to PyTorch (PTCPU), CPUAdam reduces the execution time by over 5X for all the configurations and 6.4X for the case with 1B parameters. The CPUAdam optimizer achieves high speedups by exploiting the instructionlevel parallelism, threadlevel parallelism, and the tilebased data copy scheme (as shown in line 15 of Algorithm 2). Meanwhile, although CPUAdam has a slower speed than the PyTorch Adam implementation on GPU (PTGPU), the performance gap is not very huge, and the CPU computation is not a bottleneck of the training throughout.
Figure 9 shows the comparison of the training throughput of GPT2 with and without DPU. As shown, with DPU enabled, the training achieves 1.12–1.59, updated times higher throughput than without it, for a wide range of model sizes for a small micro batch size of 8. This is expected because DPU allows the optimizer updates to overlap with the next forward computation such that the GPU does not have to be slowed down by the CPU computation and CPUGPU communication. But, what about accuracy? Convergence impact We study the convergence impact of DPU on both GPT2 and BERT. Figure 13 shows the pretraining loss curves over 100K training iterations using PyTorch (unmodified GPT2), and Figure 13 shows the loss curves of finetuning Bertlarge model on SQuAD using ZeROOffload without DPU, and ZeROOffload with DPU. In both cases, DPU is enabled after 40 iterations allowing the training to stabilize in its early stage before introducing DPU. We observe that the training curves of the unmodified GPT2 and ZeROOffload w/o DPU are exactly overlapped, because ZeROOffload w/o DPU performs only system optimizations and does not alter training dynamics. On the other hand, the training curve from ZeROOffload with DPU converges slightly slower at the very beginning of the training (e.g., barely can be seen at 2K5K iterations) and quickly catches up after 5K iterations. For the remaining of the training, the training loss matches the original training until the model converges. For BertLarge fineuning, we can see that although the training losses are not exactly the same, they converge in the same trend and are largely overlapped. Without changing any hyperparameters, ZeROOffload + DPU achieves the same final F1 score (92.8) as the baseline.
#Parameter  CPUAdam  PTCPU  PTGPU (L2L) 

1 billion  0.22  1.39  0.10 
2 billion  0.51  2.75  0.26 
4 billion  1.03  5.71  0.64 
8 billion  2.41  11.93  0.87 
10 billion  2.57  14.76  1.00 
From these results on both GPT2 pretraining, and BertLarge finetuning, we empirically verify that DPU is an effective technique to improve the training throughput of ZeROOffload without hurting model convergence and accuracy.The 1step staleness introduced by DPU is well tolerated by the iterative training process once the model has passed the initial training phase.
We presented ZeROOffload, a powerful GPUCPU hybrid DL training technology with high compute efficiency and near linear throughput scalability, that can allows data scientists to train models with multibillion parameter models even on a single GPU, without requiring any model refactoring. We opensourced ZeROOffload as part of the DeepSpeed library (www.deepspeed.ai) with the hope to democratize large model training, allowing data scientist everywhere to harness the potential of truly massive DL models.