On the Utility of Gradient Compression in Distributed Training Systems

02/28/2021 ∙ by Saurabh Agarwal, et al. ∙ 16

Rapid growth in data sets and the scale of neural network architectures have rendered distributed training a necessity. A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, the machine learning community has largely focused on developing gradient and model compression methods. In parallel, the systems community has adopted several High Performance Computing (HPC)techniques to speed up distributed training. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD. Surprisingly, we observe that due to computation overheads introduced by gradient compression, the net speedup over vanilla data-parallel training is marginal, if not negative. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide a meaningful end-to-end speedup

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Synchronous data parallel training using stochastic gradient descent (SGD) is one of the most widely adopted approaches for distributed learning 

[li2020pytorch, narayanan2019pipedream, dean2012comm]. A single iteration of distributed data parallel SGD comprises two main phases: gradient computation and gradient aggregation. During the computation phase the model gradient is computed and following that, during the aggregation phase, gradients are synchronously averaged among all participating nodes [iandola2016firecaffe, goyal2017accurate]. During this second phase, millions of parameters are communicated among nodes; this has been shown to lead to communication bottlenecks [dean2012large, seide20141, qi17paleo, grubic2018synchronous, alistarh2017qsgd]. Alleviating communication bottlenecks in distributed training has been of interest to both systems and machine learning communities, and is an active area of research.

The systems community has proposed several ways of alleviating communication bottlenecks by: (i) optimizing communication scheduling [abadi2016tensorflow, peng2019generic], (ii) overlapping communication and computation to enable higher utilization [narayanan2019pipedream, li2020pytorch], and (iii) optimizing operator placement [jia2018beyond, jia2018exploring], e.g., to switch between model parallelism and data parallelism. In addition, the systems community has borrowed ideas from the HPC field  [sergeev2018horovod, li2020pytorch], and has integrated techniques such as ring- and tree-reduce [sanders2009two]

. Both of these all-reduce techniques are bandwidth efficient and have a near constant dependence on the number of nodes. The above techniques have been implemented in several high performance communication libraries (e.g., NCCL, Gloo) which are tightly integrated into popular deep learning frameworks,

e.g.

, PyTorch 

[paszke2019pytorch, li2020pytorch]

and Tensorflow 

[abadi2016tensorflow].

Parallel to the systems research on the topic, the ML community has predominantly focused on lossy gradient/model compression methods to mitigate communication costs. This includes techniques like low precision training [seide20141, alistarh2017qsgd, bernstein2018signsgd, wen2017terngrad, acharya2019distributed], compression by only transmitting gradients of largest magnitude [aji2017sparse, acharya2019distributed, lin2017deep], and using low-rank updates [wang2018atomo, vogels2019powersgd]. Although these methods require significant effort to integrate in ML frameworks and often introduce extra hyper-parameters, they promise an impressive amount of reduction in communications, e.g., PowerSGD [vogels2019powersgd] claims to provide greater than reduction in communication with minimal effect on accuracy.

While methods developed by the systems community have been evaluated on optimized implementations of synchronous SGD [narayanan2019pipedream], we observe that previous experimental studies in gradient compression literature [vogels2019powersgd, bernstein2018signsgd] have taken little notice of the plethora of new system-level optimizations in distributed training. These system-level innovations have—orthogonally to gradient compression—been mitigating the same communication bottlenecks.

In this work we consider three representative gradient compression techniques, signSGD [bernstein2018signsgd], PowerSGD [vogels2019powersgd], and Top- sparsification [aji2017sparse]). We empirically evaluate their performance and compare against optimized implementations of data-parallel training. Our goal is to identify the main factors that dictate the performance of different compression schemes and how they perform under different hardware configurations.

In addition to evaluating these methods on existing hardware, we aim to understand how changes in compute or network availability will affect distributed training and gradient compression methods i.e., to answer the question like if bandwidth increases by and there is speedup on compute, how much faster will distributed training be with or without gradient compression?

To address this, we develop a performance model for data parallel synchronous SGD that takes into account several factors, e.g., network bandwidth, batch size, communication collectives, etc. We also account for system-level optimizations present in state-of-the-art distributed training frameworks [li2020pytorch, abadi2016tensorflow], e.g.

, gradient bucketing, communication overlap, etc. We also extend our performance model to various gradient compression schemes and verify that our performance model provides good estimates of time per iteration, across a number of models (ResNet-101, ResNet-50,

, etc). Based on our experiments and performance model we find:

  1. There is no utility in overcompressing gradients: Most of the prior work in gradient compression assumes that the higher the compression ratio the more efficient the method is. Although this is true when purely focusing on the cost of communication, we observe that in the data center setting (e.g., , bandwidth 10Gbps) a compression resulting in a the size of the original gradients suffices. Often this can be achieved simply by communicating at half precision.

  2. Increasing batch size decreases the utility of gradient compression:

    Optimized implementations of synchronous SGD are able to overlap the computation and communication phases. A longer computation phase associated with larger batches can be used to “hide” the time consumed by the communication phase. Additionally, when training for a fixed number of epochs, larger batches lead to less frequent communication per epoch. We observe that when gradient compression methods are used with large batch training they often lead to higher per iteration time than “vanilla” synchronous SGD.

  3. Compression techniques that are not all-reducible do not scale well: Several gradient compression methods are not compatible with all-reduce as their aggregation methods are not associative. We observe that compression techniques that are not all-reduce compatible suffer from massive slowdown at a large scale since their communication increases linearly with the number of machines.

    Meanwhile, methods that are all-reduce compatible show much better scalability since communication requirements remain constant. For example signSGD a popular compression method that is not all-reduce compatible takes around milliseconds for gradient computation and synchronization when using 96 GPUs for ResNet101, while the baseline (SGD without compression) which uses all-reduce finishes the same in under milliseconds.

  4. Back-propagation and gradient compression compete for computational resources: Gradient computation and gradient compression are both compute intensive and end up competing for resources on the GPU leading to slowdown when they are overlapped. (Section 3.1).

  5. For most settings there is limited opportunity for gradient compression to provide speedup: We observe that the difference between perfect scaling and optimized implementation of synchronous SGD is less than milliseconds when operating at typical bandwidth available in data centers, even for communication heavy models like  [devlin2018bert]. To provide actual speedups, gradient compression methods need to perform encode-decode and communication within this limited time-frame. On the other hand, we find existing gradient compression methods have large encode-decode times (upwards of 50 milliseconds). Further, as the bandwidth available increases (e.g., 25Gbps or more) the time-frame available for compression decreases. .

We note that the above results are derived from analyzing the per-iteration time and do not account for any loss in accuracy incurred by gradient compression. In that sense our models are generous to these techniques, as many tend to come with some loss that can only be mitigated with more iterations or additional computation (e.g., error feedback [karimireddy2019error, stich2018sparsified]). Thus, in summary, our analysis establishes that as it currently stands, for currently popular models there is marginal value in gradient compression once we account for system level optimization for synchronous SGD and common datacenter hardware.

We conclude our work by describing how our performance model can aid machine learning researchers in developing better gradient compression schemes and how data scientists can perform what-if analysis to pick appropriate gradient compression schemes to obtain end-to-end speedups.

2 Background

With increasing data and model sizes, training neural networks on a single machine even with multiple GPUs becomes a bottleneck, making scaling beyond a single machine a necessity. Scaling beyond a single machine leads to poor speedup due to communication required to synchronize across machines every iteration. The results in DawnBench [coleman2019analysis] indicate that more than 80% of the time of the training time is spent in communication. Thus, both systems and machine learning communities have focused on alleviating this bottleneck. The systems community has focused on novel ways to overlap communication with computation. PipeDream [narayanan2019pipedream] focuses on increasing training throughput by developing a novel pipeline parallelism scheme that allows multiple batches to proceed simultaneously. Further, TicTac [hashemi2018tictac] proposed an ordering on communication to provide optimal speedups and BytePS [jiang2020unified] developed a unified communication architecture for large scale DNN training. Another interesting direction from the systems community has been intelligent operator placement to run a hybrid between model and data parallel training [jia2018beyond, jia2018exploring]. On the other hand, the machine learning community has focused primarily on reducing communication by (i) minimizing the amount of data transferred using gradient compression, (ii) minimizing the frequency of communication using larger batch sizes [you2017scaling, yao2018hessian, yao2018large, smith2017don, goyal2017accurate, devarakonda2017adabatch]. In this paper, we focus on gradient compression schemes and understanding their performance at scale.

2.1 Gradient Compression

Typically DNN’s are trained using some variant of stochastic gradient descent [robbins1951stochastic, duchi2011adaptive, kingma2014adam] (SGD). Inspired by the observation that SGD can make progress even with approximate gradients, several gradient compression methods have been proposed. Broadly these methods can be grouped into quantization, sparsification, and low rank approximations.

Quantization

Quantization based methods reduce the number of bits used to represent each element of the gradient vector. There exist several methods that have studied or proposed new ways of performing quantization 

[alistarh2017qsgd, bernstein2018signsgd, karimireddy2019error, dettmers20158, seide20141, wen2017terngrad, bernstein2018signsgdtol, yu2019exploring, li2018network, horvath2019natural, tang2019doublesqueeze, dryden2016communication, strom2015scalable, shuai2019comm, zhang2017zipml]. One of the first methods proposed was 1-bit SGD [seide20141], where all gradient elements less than a user defined value are quantized to 0 and gradients greater than are quantized to 1. A recent efficient method to perform quantization is signSGD [bernstein2018signsgd, bernstein2018signsgdtol]. With signSGD all negative values are mapped to and all positive values are mapped to 1. The gradient aggregation is calculated by a majority vote, i.e., if for a particular coordinate the gradient values are -0.5,-0.1,-1.7,2 the gradient update applied after quantization will be -1. For workers this can be concisely written as where is the gradient vector and operator transforms each vector into either 1 or -1. We observe that this is one of the fastest methods in terms of encode-decode time. Therefore we choose this method as a representative of quantization-based methods. We illustrate the operation of signSGD in Figure 1.

Sparsification

Sparsification based methods perform operations to generate an extremely sparse vector and then communicate only the non-zero values and their indices. One of the most popular methods for sparsification is Top- [aji2017sparse] where the values of only the of the indices are synchronized among participating workers, e.g., see Figure 1. Another method is Deep Gradient Compression [lin2017deep]

which only communicates gradient coordinates whose absolute value is larger than a threshold. On the other hand, recent work tracks the variance of each coordinate and only communicates the gradient coordinates which have a variance less than a specified threshold 

[wangni2018gradient, tsuzuku2018variance]. Since Top- has been widely studied [shi2019understanding, lin2017deep], in this work we choose to it as part of our evaluation.

Low-rank Factorization

Recent work has shown that a gradient matrix can be compressed to two rank- matrices, and

. For convolution networks the 4D gradient tensors are reshaped to 2D matrices. For ResNet 

[he2016deep] usually and for a layer are around 512 and 4068, and there are 50 such layers in ResNet-50. Typically is chosen to be much smaller than this reduces communication complexity from to . Common values of range between 4 and 16. A higher value of provides lower compression but better accuracy [vogels2019powersgd]. We can recover the gradient matrix by computing a dot product between and and if the original gradient was a 4D tensor, it is further reshaped to the original dimension. Several methods [wang2018atomo, vogels2019powersgd, cho2019gradzip, yu2018gradiveq] have been proposed in the literature to decompose the gradient matrix into and . ATOMO [wang2018atomo] performs singular value decomposition (SVD) to calculate and matrices, but the SVD on each gradient matrix is compute intensive. On the other hand, PowerSGD [vogels2019powersgd]

uses power iterations to calculate singular values to decompose the gradient matrix into

and . Among low rank approximation methods we choose PowerSGD since it is one of the best performing low rank compression schemes [vogels2019powersgd].

Figure 1: Illustration of Sparsification(Top-K) and Quantization (SignSGD) and Low-rank (ATOMO) techniques.

2.2 System Advances

Next, we provide a brief overview of several system advances which have significantly improved the performance of distributed training.

All Reduce

In a data parallel setting the gradients need to be aggregated among all workers. In recent years, a number of systems have shifted from using a parameter server based topology to an all-reduce topology. For example, we observe that all submissions111where source code is available to DawnBench [coleman2019analysis] use all-reduce for performing distributed training.

To model communication costs we used the communication model used by [sarvotham2001connection] where cost of sending/receiving a vector of size is modeled as . Where is the latency term and is the bandwidth term. There are several optimizations for all-reduce based collectives [rabenseifner2004optimization, thakur2005optimization, hoefler2011performance, sanders2009two]. These optimizations can be thought of as design choices between the latency and bandwidth terms.

Figure 2: Overlap of Gradient Communication with Computation: The figure shows a single backward pass. We observe that communication proceeds in separate CUDA stream. It is only the last bucket for which the computation needs to wait.

NVIDIA-NCCL [nccl], a communication library from NVIDIA, has support for double-tree [sanders2009two] and ring reduce [barnett1994interprocessor]. Both double-tree and ring-reduce, send and receive bytes per machine. The latency term for double-tree is while latency term for ring-reduce is . Due to reduced latency, double-tree reduce performs better at large scale [nccl]. However double-tree reduce requires the whole message to be broken down into multiple blocks which have been shown to have high overhead at small scale [nccl]. High performance implementations like NCCL choose which algorithm to use dynamically based on several different factors like the number of machines, bandwidth, interconnect, communication size. In this work for simplicity, we analyze our results with the communication model of ring-reduce.

For an operation to be compatible with all-reduce it must be associative, i.e., the order of operations shouldn’t matter. However, Table 1 shows that several gradient compression methods are not compatible with all-reduce. In these cases, to perform gradient aggregation, the workers need to perform an all-gather operation. This can lead to communicating bytes of data and thus resulting in poor scalability as we increase the number of processors.

Compression Method All-reduce Layer-Wise Compression
syncSGD
GradiVeq [yu2018gradiveq]
PowerSGD [vogels2019powersgd]
Random- [wangni2018gradient]
ATOMO [wang2018atomo]
signSGD [bernstein2018signsgd]
TernGrad [wen2017terngrad]
QSGD [alistarh2017qsgd]
DGC [lin2017deep]
Table 1: Classifying different methods based on their compatibility with the all-reduce operations.

Communication and computation overlap

Another system level optimization which several state of the art distributed data parallel training frameworks [sergeev2018horovod, li2020pytorch] utilize is overlapping gradient computation with gradient communication. Previous works like TicTac [hashemi2018tictac] and ByteScheduler [peng2019generic] propose methods for optimal scheduling of communication. In popular deep learning frameworks [li2020pytorch, paszke2019pytorch, abadi2016tensorflow] when performing distributed training, gradient communication starts immediately when the gradients of a particular layer are available. This provides a significant advantage as oftentimes the backward pass is time consuming, and overlapping communication with backward time can help in hiding the cost of communication. Figure 2 shows an illustration of a single backward pass. We trace the training process using NVIDIA Nsight and can see how gradient communication runs in parallel with computation. We observe that if the time for the backward pass is large compared to communication time then there is very little slowdown because of communication.

Bucketing Gradients

Prior work [li2020pytorch] has also shown that naively sending gradients immediately when they are available, e.g., per-layer allreduce call can lead to large overheads. Therefore they propose bucketing of gradients, where buckets of a fixed size are created and once the gradients for a bucket have been calculated then all-reduce is called on the entire bucket. As bucket sizes are typically large (25 MB in PyTorch), gradient bucketing amortizes the cost of invoking all-reduce. Gradient bucketing is used by popular ML frameworks including Distributed DataParallel [li2020pytorch] in PyTorch and Tensorflow [tensorflow2020dist].

3 Analysis of Gradient Compression Schemes

Figure 3: Effects Overlap of Gradient Compression with Computation:When Gradient compression is overlapped with computation it ends up requiring more time per iteration than performing it sequentially. We believe this is due to resource contention for compute resources.

In this section we first study the scalability of existing gradient compression schemes. We start by analyzing the effects of overlapping gradient compression and gradient computation (Section 3.1). Next we run large scale experiments (upto 96 GPUs), comparing the scalability of distributed implementations of SyncSGD (SGD without compression) with the gradient compression methods described in Section 2.

(a) ResNet 50: Batch Size 64
(b) ResNet 101: Batch Size 64
(c) BERT: Batch Size 12
Figure 4: Scalability of PowerSGD: When compared against an optimized implementation of syncSGD, PowerSGD provides little speedup. Except for the case for where there are some wins, PowerSGD has a high per iteration time.

3.1 Overlapping Compression, Computation

To analyse the consequences of overlapping gradient compression with gradient computation we integrate gradient compression methods to run in parallel with computation.

We observe that when gradient compression is performed in parallel with backward computation it is slower than performing gradient compression after calculating the backward pass. Figure 3 depicts this phenomenon for PowerSGD compressing to Rank-4, Topk-1% and signSGD. We attribute this to the fact that both gradient compression and gradient computation are compute-heavy steps, and when gradient compression is performed in parallel with computation they end up competing for resources on the GPU leading to an overall slow down. On the other hand, synchronous SGD only performs all-reduce operation which is communication heavy with very little compute involved, leading to little contention for compute resources. Therefore for the rest of our experiments we perform gradient compression after gradient computation.

In summary we find:

Takeaway: Gradient Compression methods are poor candidates for overlap with gradient computations since both gradient compression and gradient computation are compute heavy processes leading to an overall slowdown.

3.2 Comparing compression schemes

(a) ResNet50: Batch Size 64
(b) ResNet 101: Batch Size 64
(c) BERT: Batch Size 12
Figure 5: Scalability of Top-: We compare the time taken for gradient computation and aggregation for Top- with syncSGD. Across all three datasets we observe that at large scale, due to lack of support for all-reduce and high encode-decode time Top- performs considerably slower than syncSGD. Note: For BERT we could not scale Top- beyond 32 GPUs, because the memory requirement of Top- increases linearly with number of machines and for BERT we ran out of available memory.
(a) ResNet50: Batch Size 64
(b) ResNet 101: Batch Size 64
(c) BERT: Batch Size 12
Figure 6: Scalability of SignSGD: We compare the time taken for gradient computation and aggregation for signSGD with syncSGD. Across all three datasets we observe that at large scale, due to lack of support for all-reduce signSGD linearly increasing decode time it performs considerably slower than syncSGD. Note: For BERT we could not scale signSGD beyond 32 GPUs, because the memory requirement of signSGD increase linearly with number of machines and for BERT we ran out available memory.

We next analyse the performance of three popular gradient compression methods: PowerSGD, TopK, and signSGD for a different number of machines. For all our experiments we used p3.8xlarge instances on AWS. Each instance has 4 GPUs and provides around 10Gpbs of bandwidth. We scale our experiments to 96 GPUs (24 p3.8xlarge instances). We consider weak scaling, i.e., where the number of inputs per worker is kept constant as we increase the number of workers. This is the most commonly used scenario for evaluating the scalability of deep learning training [coleman2019analysis, narayanan2019pipedream]. Thus, when we refer to a particular batch size used for training, this is the batch size used at each worker.

We use Resnet-50, Resnet-101 and as the models to study given their contrasting properties. ResNet-50 is an extremely popular, compact (97MB) yet computationally heavy model for its size. ResNet-101 on the other hand is also computationally heavy but is larger (170 MB).

is an extremely popular language model which is also quite communication heavy (418 MB). For all our timing measurements on vision models we used ImageNet Dataset 

[deng2009imagenet] and for we used Sogou News dataset [sun2019fine]

. For the timing measurements we run 110 iterations for each setup and discard the first 10. For the remaining hundred we take the average. For error bars, we use standard deviation.

PowerSGD

PowerSGD provides around compression when using Rank-4 for ResNet-50.

We first study the scalability of PowerSGD when compared to synchronous SGD for ResNet50, ResNet101 and . We use Rank-4, 8 and 16 as these are shown to achieve good accuracy in the experimental study by the authors of PowerSGD [vogels2019powersgd]. As shown in Figure 4 we can see that PowerSGD with Rank 4, 8, and 16 is slower than synchronous SGD for Resnet-50 and Resnet-101 with batch size 64 (We investigate varying batch sizes in Section 3.3). This is primarily because synchronous SGD does not incur any overheads from compression and is able to overlap communication with computation. On the other hand, for BERT, which is a much larger model, we see that for 96 GPUs, Rank-4 and Rank-8 are faster than synchronous SGD by around 23.1% and 13.9% respectively, while Rank-16 takes longer than syncSGD.

TopK

Sending Top K% of gradients by absolute value is one of the popular sparsification methods. However, we observe that high encode-decode time and lack of all-reducibility affects the scalability of Top-.

Results comparing Top- to synchronous SGD are shown in Figure 5. We see across all the three models, even when using Top--1%, i.e., when 99% of the entries in the gradient are dropped, there are no performance gains when compared to synchronous SGD. This is primarily because of extremely high encode time and incompatibility with all-reduce.

(a) Resnet101: Batch Size 16
(b) Resnet101: Batch Size 32
(c) Resnet101: Batch Size 64
Figure 7: Effect of varying batch size: We observe that large batch sizes provide enough opportunity to syncSGD to hide the communication time, meanwhile at small batch sizes due to reduced computation time this overlap is not possible. Therefore making gradient compression methods more useful at small batch sizes. Here we compare PowerSGD against Resnet101, since it is one the most communication heavy model we study.

SignSGD

We next study SignSGD with majority vote, where we only send 1bit for each 32bit leading to around gradient compression.

Results comparing synchronous SGD to SignSGD are shown in Figure 6. We see that, although SignSGD is extremely quick to encode and decode, due to lack of all reducibility the communication time scales linearly. Moreover SignSGD at best can provide compression which is not enough to offset the lack of all reducibility.

Takeaway: Existing gradient compression methods provide limited benefits either due to encoding overheads or due to lack of all-reducibility across a range of models.

3.3 Varying batch size

We next consider PowerSGD, the best performing compression scheme among the ones in the previous section and study how changing the batch size affects scalability. We use Resnet-101, and analyse the effect of varying batch sizes. In Figure 7, we find that the benefits of using PowerSGD with Rank-4 drops as we increase batch size. For instance, while PowerSGD Rank-4 provides almost 40% speedup when using a batch size 16, the speedup drops to 20% for batch size 32. In fact, with batch size 64, we find that the PowerSGD Rank-4 is around 10% slower than synchronous SGD. We observed a similar trend with other models too, for e.g., when training BERT with 64 machines and batch size at , PowerSGD Rank-4 provides 24% speedup but when we increase the batch-size to 12, the speedup drops to 18%. In general, increasing batch size leads to greater opportunities for synchronous SGD to overlap computation and communication.

Takeaway: Using large batch sizes often provides enough opportunity for synchronous SGD to overlap communication with communication thereby reducing the extent of benefits achieved from using gradient compression.

4 Performance Model

In the previous section we observed that speedups provided by gradient compression algorithms are quite limited in the standard data-center setting. We next look at further investigating two aspects: first why are gradient compression methods not able to provide significant speedups and under what setups should we expect to see benefits of reduced communication due to gradient compression. Second, we wish to study how will changes in hardware and network interconnects affect these speedups. Unfortunately the choices for hardware and network bandwidth are quite limited making it infeasible to perform such analysis using real-world experiments, i.e., it becomes impossible to perform what-if analyses to study how does the performance get affected under 100Gbps bandwidth or an faster GPU.

Therefore, similar to prior work in understanding performance [ousterhout2015making, zhang2020network], we first introduce a performance model which allows us to get good estimates of the wall clock time when gradient compression methods are used with system advances. The performance model acts as a simulator allowing us to perform several what-if analyses to understand the utility of gradient compression algorithms. Moreover the model helps us to reason about real-world observations and provides a framework for data scientists and engineers to reason about the expected performance of a gradient compression scheme.

4.1 Distributed Data Parallel

First we construct a performance model for PyTorch Distributed data parallel [li2020pytorch] and Tensorflow Distributed [tensorflow2020dist] since these are two most popular deep learning frameworks.

Here we assume that a model can be partitioned into buckets, where first buckets are of size and the last bucket is of size , where . Because of optimizations listed in Sections 2.2 and 2.2 the total time observed for backward pass and gradient synchronization for synchronous SGD becomes:

where is the total time observed for gradient calculation and synchronization, is the compute time required to perform the backward pass. is time required to communicate gradient buckets of size across GPUs at bandwidth. While is the time to communicate bucket of size , this represents the time required by the last bucket which cannot be overlapped with communication. is a value greater than 1, it represents the factor of slowdown in backward pass due to overlap of backward pass with communication. The value of at a given bandwidth largely depends on whether the method is all-reduce compatible or not. In case of synchronous SGD when we use ring reduce the time for communication becomes [thakur2005optimization], becomes

(1)

Where is the latency coefficient is the bucket size, is the number of GPUs and is the amount of bandwidth available.

(a) Performance model on syncSGD
(b) Performance model on PowerSGD
(c) Performance model on SignSGD
Figure 8: Evaluating our performance model in real world: We evaluate our performance model on AWS on p3.8xlarge instance. We observe that our performance model quite closely tracks the actual performance of both syncSGD implementation of PyTorch as well as performance of gradient compression methods. In our experiments we saw large amount fluctuations in values. Our error bars represent standard deviation we observed. Before all experiments we calculated the available pairwise bandwidth using iperf3[iperf3], and calculate the latency term by performing all reduce based on the vector of size equivalent to number of machines. Note: For BERT we could not scale signSGD beyond 32 GPUs, because signSGD’s memory requirement increase linearly with number of machines and for BERT we ran out available memory.

4.2 Gradient Compression

From the perspective of performance, the scalability of a compression method depends on two main factors i) can the aggregation be performed using all-reduce ii) does the method operate on gradients of all layers together or on each layer, if the method can operate on each layer then it can overlap communication and compression with backward computation.  Table 1 classifies a number of gradient computation methods into these two buckets. Ideally for high scalability we would like the method to be both all-reduce compatible and support layer-wise compression.

Next, we present a performance model for training when we use gradient compression. Assuming that we can overlap gradient compression and communication, a generic performance model will be

where is the time required for gradient computation, is the overhead of compressing and decompressing the gradients. is the time required to communicate compressed gradients. Where represents number of buckets after gradient compression, while represents the size of buckets and represents number of GPUs. is time to communicate the last unoverlapped bucket of size . represents the amount of slowdown due to performing encode-decode and communication in parallel with gradient communication.

We now derive specific performance models for different gradient compression schemes from the generic model stated above.

PowerSGD

Since PowerSGD drastically reduces gradient size, it can send the data in a single bucket. However PowerSGD requires sending two smaller matrices namely P and Q, thus incurring twice the latency overhead. Moreover performing encode-decode operation separately on each layer leads to significant overhead. i.e., encode-decode is most efficiency when performed on the full gradient vector (gradient of all the layers). Taking into account the lack of benefits from overlap in the previous section, our performance model now becomes:

Where p is the number of GPUs. and are low rank matrices communicated by PowerSGD. can be calculated using Equation 1.

Top-K

For Top-  the output of compression is the Top- gradient values () and their corresponding indices () and thus similar to PowerSGD this leads to twice the latency overhead. Further, Top- is not all-reducible, we get , where is the gradient size, p is the number of GPUs. A similar calculation applies to the indices. Overall the performance model becomes.

SignSGD

SignSGD, only sends 1bit for each 32bit leading to around gradient compression. However SignSGD is not compatible with all-reduce leading to a performance model as follows:

where and

4.3 Methodology

Compression Method Compression Parameter (ms)
PowerSGD Rank-4 45
Rank-8 64
Rank-16 130
Top-K 20% 295
10% 289
1% 240
signSGD 16.34
Table 2: Encode & Decode times for ResNet-50, 4 machines: Even for a comparatively small network like ResNet-50, where is around 122ms, gradient compression methods have high overhead

We first empirically verify our performance model with experiments on Resnet-50, Resnet-101, and on several different batch sizes. We used 2,8,16,24 p3.8xlarge instances (up to 96 GPUs) on AWS. Before each run we calculate available bandwidth between each pair of instances using iperf3 [iperf3] and take the minimum of these values as . For calculating we perform ring-reduce on a small tensor and divide the obtained value by where is the number of GPUs.

Since with synchronous SGD the backward pass and gradient synchronization are overlapped, we can not calculate backward time by using timers. To calculate for synchronous SGD we first calculated the time for backward pass on a single machine (). Next we run distributed training but with Nsight Systems profiling switched on. From Nsight systems we are able track when kernels are launched during backward pass and can find how long the backward pass takes. The ratio between the two allowed us to calculate . For all our experiments we disable NCCL auto tuning and forced it to use ring algorithm by setting the NCCL_TREE_THRESHOLD=0. As shown in Figure 8 we observe that our performance model is very close to the actual time observed. The median difference between our predictions to the actual measured runtime is 1.8% for SyncSGD, 1.37% for PowerSGD and 14.2% for signSGD. We believe the reason for high difference for signSGD is that AllGather collective has an all to one pattern which is know to cause degraded network performance due to issues like incast [chen2009understanding, alizadeh2010data]. Overall we find out performance model is sufficient to predict trends and compare various aspects of gradient compression.

Using the performance model

Having constructed and verified the performance model, it becomes quite easy to evaluate the performance of gradient compression under different bandwidths and training scenarios. The value of depends on hardware, computation requirements of the model and the batch size used for training. We calculate for SignSGD, Topk and PowerSGD. We only include the computation time and disregard the time for extracting gradients, or copying back the decompressed gradients to the model. We believe that these timings can be improved with tighter integration with the training frameworks. These values serve as for different analysis. Table 2 shows the times for and for ResNet-50 when using V100 GPUs on AWS. We omit similar measurements for , ResNet-101 for brevity.

5 Towards Ideal Gradient Compression

(a) ResNet50: 64 GPUs
(b) ResNet101: 64 GPUs
(c) BERT: 64 GPUs
Figure 9: Required gradient compression for near optimal speedups: We observe that the required gradient compression for near optimal scaling is quite small. At 10Gbps even for quite small batch sizes we need less than gradient compression, which is quite small compared to what popular gradient compression methods.
(a) ResNet 50: Batch Size 64
(b) ResNet-101: Batch Size 64
(c) BERT: Batch Size 12
Figure 10: Difference between ideal speedup and observed training speed using optimized implementations of syncSGD: We observe that the difference between ideal and syncSGD is quite small(less than 200 ms) even at 10Gpbs. This small difference provides little opportunity for gradient compression methods to provide actual speedups.

Given the limitations of existing gradient compression schemes we next discuss the design parameters that can be useful in designing an ideal gradient compression scheme, i.e. a scheme that will give linear speedup as we increase the number of machines. We first discuss how much compression do we need and then we discuss how much encode-decode time do we have compared to synchronous SGD. For these studies, we use our performance model.

Compression for perfect scalability

We use our performance model to determine how much compression is required to achieve ideal scalability. For this analysis we disregard encode-decode time. We note that the ideal scaling scenario for weak scaling is for the time per iteration to stay constant as we add more machines. This will happen when we can overlap most of the communication with computation. We simplify the performance model with the assumption that the whole communication phase can be overlapped and sent in one bucket i.e., we do not send the last bucket of gradient separately after computation. From this we get the condition for ideal speedup:

Where represents the time required to sync compressed gradient of size among GPUs at bandwidth. Assuming that the compression strategy is all-reducible, this allows us to calculate the ideal value of for at a given bandwidth , for a given number of GPUs.

In Figure 9 for different batch sizes and models we indicate the amount of gradient compression required so we see almost linear scaling for 64 GPUs. From this we observe that even at 10Gbps we ideally need at most gradient compression even for extremely small batch sizes. For larger batch sizes at 10Gbps this value goes down even further; e.g., a large model like BERT requires less than compression to achieve near linear scaling. We would like to note that our simplifying assumption that whole of the gradient communication gets overlapped is unrealistic, since there will always be a part of the gradient that will only be available for aggregation once gradient calculation is complete, thus leading to less than linear scalability. However the values from our analysis give a very good ballpark estimate of how much compression is required.

Bounding Encoding time

Next we analyse how far synchronous SGD is from linear scaling. This provides an upper bound for how much time can be spent on encoding/decoding by a compression scheme, especially considering that they cannot be effectively overlapped (Section 3.1).

We observe in Figure 10 that the gap between the time taken by synchronous SGD and ideal scaling grows with models size. The difference is only around 50ms at 150 machines for Resnet-50 and grows to around 100ms for Resnet-101 and around 200ms for BERT. However, we also note that the gradient compression schemes need to compress a larger vector in the case of a larger model.

Takeaway: Ideal gradient compression only need to provide compression for linear speedup but the encode-decode times need to be bound by ms for them to be effective.

(a) ResNet50: Batch Size 64
(b) ResNet101: Batch Size 64
(c) BERT: Batch Size 12
Figure 11: Evaluating effect of network bandwidth on training: We vary bandwidth availability and analyse the performance of synchronous SGD vs PowerSGD Rank 4. We observe that as bandwidth increase significantly it helps synchronous SGD since it has a larger communication overhead. Moreover we observe the PowerSGD provides massive gains at extremely low bandwidth (1Gbps) but as bandwidth scales we see PowerSGD gets bounded by compute availability.

6 What-If Analysis

Our performance model also allows us to consider several what-if scenarios. To understand how and where gradient compression methods will be useful, we can vary several factors like compute power available, encode-decode time, network bandwidth etc. Based on our results in Section 3.2, we use PowerSGD with Rank-4 as the baseline for these what-if analyses.

Effect of Network Bandwidth

In Figure 11 we vary network bandwidth available from 1Gbps to 30Gbps and see how this changes the speedup offered by PowerSGD. We see that, for example, in the case of Resnet-50, PowerSGD offers considerable speedup at low network bandwidths (1-3 Gbps) but becomes slower than synchronous SGD when bandwidth available becomes . This is due to the fact than synchronous SGD benefits more from availability of higher bandwidth since it communicates significantly more while PowerSGD is still limited by extra time spent in the encode-decode step. For BERT which is a communication heavy network, PowerSGD becomes slower than synchronous SGD at 15 Gbps.

Effect of faster compute

Next we analyze how the effect of gradient compression changes, if newer hardware comes along which performs neural network training faster.

In Figure 12, we plot the effect of compute capabilities improving by up to , while network bandwidth remains constant at 10 Gbps. We can see that for Resnet-50, PowerSGD with Rank-4 can provide 1.75x speedup if the compute becomes around 3.5x faster.

There are two reasons for this, (i) As compute gets faster, the encode-decode time also reduces by the same factor, (ii) with a faster backward pass, there is less opportunity for synchronous SGD to overlap computation with communication, making it communication bound.

(a) ResNet50: Batch Size 64
(b) ResNet101: Batch Size 64
(c) BERT: Batch Size 12
Figure 12: Evaluating effect of compute speedup on training time:Assuming network capacity remains at 10Gigabit but compute capabilities go up, we observe in that case PowerSGD will end up providing significant benefit, meanwhile synchronous SGD will end up being communication bound and will not be able to utilize increased compute. Showing that if compute capabilities increase drastically but network bandwidth remains stagnant, gradient compression methods will become useful.

Tradeoff between encode-decode time and compression ratio

Finally, we explore the tradeoff between the effect of reducing encode-decode time, while simultaneously decreasing the compression ratios by similar proportions. For this we consider a hypothetical gradient compression scheme in which if we decrease encode-decode time by a factor the size of gradients communicated increases by . For example, if say and then a 2x decrease in encode-decode time would be accompanied by a 4x increase in size of gradients. This setup is to study what would happen if we had compression schemes that offered a variety of trade-off points. We vary from 1 to 4 in increments of 1 and try 1,2 and 3 as values of . Using PowerSGD with Rank-4 as the baseline, we see in Figure 13 that any reduction in encode-decode time even at the expense of increased communication helps.

Takeaway: Improvements in network bandwidth will make gradient compression less effective, whereas improvements in compute can make them more effective.

(a) ResNet50: Batch Size 128
(b) ResNet101: Batch Size 64
(c) BERT: Batch Size 12
Figure 13: Varying encoding-decoding time and compression : We observe that reducing encode-decode time even if it leads to reduced gradient compression is very useful and can make methods like PowerSGD more viable.

7 Discussion and Takeaways

Our results from the previous three sections indicate that existing gradient compression methods provide limited benefits for distributed learning in datacenter settings. In this section we summarize the key takeaways from our analysis and also discuss how our performance modeling approach can be useful for algorithm developers and users.

Better Gradient Compression Schemes:

We consistently observe that gradient compression methods which are all-reducible are much more scalable and thus aggregation functions in future designs must be associative. We also find that overheads in encoding and decoding gradients can lead to slowdowns. Thus, the focus can be shifted from getting extremely high compression ratios to, reducing encode/decode time.

What-if analysis for users:

Data scientists who build ML models are faced with an array of choices in terms of gradient compression algorithms to use. The performance model that we have developed can be used to determine if a certain gradient compression method will be beneficial given a user’s compute (e.g., GPU generation) and network (e.g., RDMA) setup. For e.g., in a recent work  [ramesh2021zeroshot] after significant engineering effort the engineers were able to use PowerSGD to speed up training for an extremely large model(12 Billion Parameters). Our performance model can enable users to reason about expected performance gains thus helping in better decision making.

Workload trends

Highly scalable Sync SGD implementations like Pytorch DDP and Horovod rely on the overlap between communication and backward pass to provide high speedup. If the backward time reduces significantly but network bandwidth doesn’t increase, then there will be fewer opportunities for overlap. In such cases communication will become a significant bottleneck and gradient compression methods like PowerSGD could provide significant speedups. Correspondingly if the workloads change such that new DNN operators have low compute density but use the high number of parameters, then in those scenarios existing gradient compression methods can again provide speedups.

Extending to other scenarios:

Finally, while our focus on this paper was on gradient compression schemes, this approach can be extended to help users optimize for other training options as well. For example, choosing an effective batch size is similarly challenging and a similar approach can be used. Finally, while we focus exclusively on performance in this study typically data scientists desire to choose the appropriate optimizations without sacrificing model accuracy. Developing methods that can reason about accuracy along with performance is an avenue for future work.

8 Related Work

Several prior studies have been performed to evaluate efficacy of distributed training. MLperf [mattson2019mlperf] and DawnBench [coleman2019analysis] are two well know industry supported efforts to perform periodic benchmarking on training and inference speed at scale. TBD [tbd] and Daydream [daydream] perform profiling of DNNs to aid in determining appropriate low-level compute/memory optimizations. Recently there has been work on studying both theoretical and practical aspects to gradient compression. In [tang2020communication] authors study several aspects of distributed learning and provide a comprehensive survey of both theoretical and practical aspect of distributed machine learning. Perhaps closest to our work is [zhang2020network] where authors perform a detailed study as to whether network is the bottleneck in distributed training. Unlike  [zhang2020network], our study focuses on the utility of gradient compression methods in different settings and analyzes others aspects beyond just the effect of the network bandwidth. On the gradient compression side, previous studies include surveys on compression algorithms to use on edge devices [murshed2019machine].

Our approach of using a performance model to do what-if analysis has also been applied in other domains. Prior work has looked at modeling workloads [magpie] and developing tools for performance debugging [aguilera2003performance] of blackbox systems. In this paper we model gradient compression algorithms in terms of the computation and communication required. Prior work in big data analytics [ousterhout2015making] has also performed blocked-time analysis to determine the importance of network bandwidth, straggler mitigation etc. We use a similar approach but our synchronous SGD workload is simpler as it only involves computation and communication of gradients.

9 Conclusion

In this work we perform detailed analysis of several gradient compression methods used to accelerate distributed ML training. We discover that existing gradient compression methods provide marginal speedups in a realistic setup due to the high overhead of performing compression. We develop a performance model that can be used to guide algorithm designers building scalable gradient compression algorithms. Moreover our performance model allows us to conduct what-if analyses that can help users determine how much compression do they need given improvements in network bandwidths. We believe this analysis will provide the community clarity on the desirable properties for gradient compression and will lead to methods which can provide improved speedup in the future.

Acknowledgements

Dimitris Papailiopoulos is supported by an NSF CAREER Award #1844951, two Sony Faculty Innovation Awards, an AFOSR & AFRL Center of Excellence Award FA9550-18-1-0166, and an NSF TRIPODS Award #1740707. Shivaram Venkataraman is supported by the National Science Foundation grant CNS-1838733, a Facebook faculty research award and by the Office of the Vice Chancellor for Research and Graduate Education at UW-Madison with funding from the Wisconsin Alumni Research Foundation.

References